CRAN Analysis

1 Introduction

In this section, we try to answer our research questions based on the data we have collected for the R programming language. We have already gone over a general EDA, but here we want to characterize the R packages by sectors, organizations/institutions, and countries, and also attribute credit towards the most influential actors by aggregating towards these characterization variables. Also, we’d like to construct a package network to see how packages are linked to each other. Finally, we have a number of impact measures (e.g. additions, reverse dependencies…etc) we will use to identify the most important packages in the R community. A number of impact measures will only be available for the packages we were able to collect GitHub data for (e.g. stars, forks).

Code
library(tidyverse)
library(RMySQL)
library(ggwes)
library(knitr)
library(kableExtra)
library(pander)
library(ggthemes)
library(readxl)

2 File List

2.1 Input Files

  • cran: Full CRAN Database as of September 2023 with selected metadata

  • cran_repos: CRAN GitHub repos loaded from database containing repository metrics

  • cran_users: CRAN GitHub Users data loaded from database containing information like sector/organization

  • user_commits: CRAN GitHub user commit data containing additions and deletions

  • user_countries: CRAN User country data cleaned

Code
cran <- read.csv("\\\\westat.com\\DFS\\DVSTAT\\Individual Directories\\Askew\\Paper_Data\\cran.csv")%>%
                  dplyr::select(-X)

cran_repos <- read.csv("\\\\westat.com\\DFS\\DVSTAT\\Individual Directories\\Askew\\Paper_Data\\cran_repos.csv")%>%
                    dplyr::select(-X)

cran_users <- read_excel("\\\\westat.com\\DFS\\DVSTAT\\Individual Directories\\Askew\\Paper_Data\\cran_users.xlsx")
cran_users <- cran_users[,-1]

user_commits <- read.csv("\\\\westat.com\\DFS\\DVSTAT\\Individual Directories\\Askew\\Paper_Data\\cran_user_commits.csv")%>%
                    dplyr::select(-X) 

user_countries <- read.csv("\\\\westat.com\\DFS\\DVSTAT\\Individual Directories\\Askew\\Paper_Data\\cran_user_countries.csv")%>%
                    dplyr::select(-X)

3 Analysis: R Programming Language

In this section, we try to answer our research questions based on the data we have collected for the R programming language. We have already gone over a general EDA, but here we want to characterize the R packages by sectors, organizations/institutions, and countries, and also attribute credit towards the most influential actors by aggregated towards these characterization variables. Also, we’d like to construct a package network to see how packages are linked to each other. Finally, we have a number of impact measures (e.g. additions, reverse dependencies…etc) we will use to identify the most important packages in the R community. A number of impact measures will only be available for the packages we were able to collect GitHub data for (e.g. stars, forks).

3.1 Characterizing the Open Source Software Ecosystem

3.1.1 All CRAN Packages (As of September 2023)

3.1.1.1 What is the distribution of R Packages and R Package maintainers by sector?

Out of 19,852 packages, we were not able to identify a sector for 12,721 of them. For the ones where a sector was found (7131), 6240 were identified as academic, 583 as business, 166 as government, and 142 as nonprofit

Code
## sectors based on packages
pander(table(cran$Sector, useNA = "always"))
Academic Business Government Nonprofit Unknown NA
6240 583 166 142 12721 0

Out of 10,821 unique maintainers, we were able to identify a sector for 4,014 of them. 3,639 are from the academic sector, 196 from the business sector, 87 from the government sector, and 92 from nonprofit sector

Code
## sectors based on unique maintainers
cran_unique <- cran %>%
                    distinct(email, .keep_all = T)

pander(table(cran_unique$Sector, useNA = "always"))
Academic Business Government Nonprofit Unknown NA
3639 196 87 92 6807 0

Based on all CRAN Packages that we were able to extract a sector from, 88% are academic, 8% are business, 2% are government, and 2% are nonprofit. When looking at the unique maintainers, 91% are academic, 5% are business, 2% are government, and 2% are nonprofit.

Code
# Calculate counts by sector (All packages)
cran_sector_counts <- cran %>%
  filter(Sector != "Unknown") %>%
  count(Sector) %>%
  mutate(proportion = n / sum(n),
         proportion_label = paste0(round(proportion * 100, 1), "%")) 

# Save plot
cran_sector_counts_plot <- ggplot(cran_sector_counts, aes(x = Sector, y = n)) +
  geom_bar(stat = "identity", fill = westat_blue()) +
  geom_text(aes(label = proportion_label), vjust = -0.3) +
  ylab("Count of Packages") +
  ylim(c(0, 7000))+
  ggtitle(label = "Sector Distribution of All R packages")+ 
  labs(caption = "*64% Unknown for packages (removed from analysis)")+
  theme_clean()

cran_sector_counts_plot

# Calculate counts by sector (For unique Maintainers)
cran_sector_counts_unique <- cran %>%
  distinct(email, .keep_all = T)%>%
  filter(Sector != "Unknown") %>%
  count(Sector) %>%
  mutate(proportion = n / sum(n),
         proportion_label = paste0(round(proportion * 100, 1), "%")) 

# Save plot
cran_sector_counts_unique_plot <-ggplot(cran_sector_counts_unique, aes(x = Sector, y = n)) +
  geom_bar(stat = "identity", fill = westat_blue()) +
  geom_text(aes(label = proportion_label), vjust = -0.3) +
  ylab("Count of Maintainers") +
  ylim(c(0, 7000))+
  ggtitle(label = "Sector Distribution of Unique All R Package Maintainers")+ 
  labs(caption = "62% Unknown for unique maintainers (removed from analysis)")+
  theme_clean()

cran_sector_counts_unique_plot

3.1.1.2 What are the top 10 institutions developing R Packages based on the number of packages and number of unique maintainers?

Based on all packages, the most frequent institution identified in the maintainer email domains is Rstudio followed by Harvard University. However, if we base it on unique maintainer email domains, then Harvard becomes most frequently identified institution, followed by Rstudio. It appears that a lot of the packages developed from Rstudio domains are the same ones.

Code
### sorting to the top 10 most common institutions for packages
top10_Institutions <- sort(table(cran$Institution), decreasing = T)
top10_Institutions <- as.data.frame(head(top10_Institutions, 10))

colnames(top10_Institutions) <- c("Institution", "Freq")

### joining to institution dataframe to get sector variable
top10_Institutions <- cran %>% 
  right_join(top10_Institutions, by = "Institution")%>%
  distinct(Institution, .keep_all = T)%>%
  select(Institution, Sector, Freq)%>%
  arrange(desc(Freq))

### sorting to the top 10 most common institutions for distinct maintainers
top10_Institutions_unique <- sort(table(cran_unique$Institution), decreasing = T)
top10_Institutions_unique <- as.data.frame(head(top10_Institutions_unique, 10))

colnames(top10_Institutions_unique) <- c("Institution", "Freq")

### joining to institution unique dataframe to get sector variable
top10_Institutions_unique <- cran %>% 
  right_join(top10_Institutions_unique, by = "Institution")%>%
  distinct(Institution, .keep_all = T)%>%
  select(Institution, Sector, Freq)%>%
  arrange(desc(Freq))

### Graph output of top 10 institutions for packages
 ggplot(top10_Institutions, aes(x = reorder(Institution, Freq), y = Freq, fill = Sector))+
   geom_bar(stat = "identity") +
    coord_flip() +
    scale_y_continuous(expand = c(0,0)) +
    labs(x = "", y = "Number of Packages",
         title = "Top 10 Institutions for All R Packages" ) +
    ylim(c(0, 350))+
  scale_fill_westat(option = "BLUES", drop = FALSE)+
  theme_clean()


 
 ### Graph output of top 10 institutions for unique maintainers
 ggplot(top10_Institutions_unique, aes(x = reorder(Institution, Freq), y = Freq, fill = Sector))+
   geom_bar(stat = "identity") +
    coord_flip() +
    scale_y_continuous(expand = c(0,0)) +
    labs(x = "", y = "Number of Maintainers",
         title = "Top 10 Institutions for Unique Maintainers" ) +
    ylim(c(0, 350))+
  scale_fill_westat(option = "BLUES", drop = FALSE)+
  theme_clean()

Code
### Table output of top 10 Institutions for packages
top10_Institutions %>%
  kbl(caption = "Most Frequent Institutions for Packages", escape = F)%>%
  kable_classic()%>%
  kable_styling(font_size = 12, full_width = T)%>%
 row_spec(0, bold = T, background = westat_blue(), color = "white")%>%
  column_spec(1:2, border_right = T)%>%
  scroll_box()
Most Frequent Institutions for Packages
Institution Sector Freq
RStudio Business 329
Harvard University Academic 146
University of California-Berkeley Academic 103
NetEase Business 98
University of Washington-Seattle Campus Academic 91
University of Michigan-Ann Arbor Academic 91
University of Minnesota-Twin Cities Academic 84
Stanford University Academic 84
University of Wisconsin-Madison Academic 79
University of Auckland Academic 78
Code
### Table output of top 10 Institutions for unique maintainers
top10_Institutions_unique %>%
  kbl(caption = "Most Frequent Institutions for Unique Maintainers", escape = F)%>%
  kable_classic()%>%
  kable_styling(font_size = 12, full_width = T)%>%
 row_spec(0, bold = T, background = westat_blue(), color = "white")%>%
  column_spec(1:2, border_right = T)%>%
  scroll_box()
Most Frequent Institutions for Unique Maintainers
Institution Sector Freq
Harvard University Academic 71
NetEase Business 57
University of Washington-Seattle Campus Academic 55
University of Michigan-Ann Arbor Academic 54
RStudio Business 49
University of Minnesota-Twin Cities Academic 46
University of California-Berkeley Academic 39
Stanford University Academic 36
University of Wisconsin-Madison Academic 34
Google Business 32

3.1.2 GitHub R Packages

As stated in the introduction, we also collected data from GitHub for all R packages that we were able to identify with a repository. Github provides us with more data including repository statistics and data at the contributor level, which would be each individual that is a collaborator on a given repository. We can now look at distributions at both the maintainer and contributor levels to compare. For now, we’ll still just be looking at the package level, meaning the maintainer level information of the packages.

After linking to GitHub, we are able to identify repository data for 7,844 out of the 19,852 packages on CRAN

We first have to extract the slug from all packages that have a GitHub URL

Code
#### filtering for URLs that only contain github.com in the link
cran_github <- cran %>% filter(grepl("https://github.com", URL, ignore.case = TRUE))

### extracting the URL portion with the slug
cran_github <- cran_github %>%
          mutate(URL = str_extract(URL, "https://github.com/([^/]+)/([^/]+)"))


### removing commas
cran_github <- cran_github %>%
          mutate(URL = sub(",.*$",  "",  URL))

### extracting slug portion
cran_github <- cran_github %>%
          mutate(slug = str_extract(URL, "(?<=github.com/)[^/]+/[^/]+"))


cran_github <- cran_github %>%
          mutate(slug = str_extract(slug, "[^\\s]+/[^\\s]+"))

We can now join the original cran dataframe to the repositories we collected data for

Code
### creating slug for linkage
cran_repos <- cran_repos %>%
  mutate(slug = paste(owner, repo, sep = "/"))

### join to cran by Package for more data
cran_repos <- cran_github %>%
                  left_join(cran_repos, by = "slug")%>%
                  distinct(slug, .keep_all = T)


### create "year_created" variable
cran_repos$year_created <- substr(cran_repos$created_at, 1, 4)

3.1.2.1 What is the distribution of GitHub R Packages and GitHub R Package maintainers by sector?

Out of 7,844 packages identified on GitHub, we were able to identify a sector for 2379 of them. For the ones where a sector was found, 1858 were identified as academic, 385 as business, 70 as government, and 66 as nonprofit

Code
pander(table(cran_repos$Sector, useNA = "always"))
Academic Business Government Nonprofit Unknown NA
1858 385 70 66 5465 0

Out of 4267 unique maintainers identified on GitHub, we were able to identify a sector for 1322 of them. 1132 were identified as academic, 109 as business, 39 as government, and 42 as nonprofit

Code
## sectors based on unique maintainers
cran_repos_unique <- cran_repos %>%
                    distinct(email, .keep_all = T)

pander(table(cran_repos_unique$Sector, useNA = "always"))
Academic Business Government Nonprofit Unknown NA
1132 109 39 42 2945 0

Based on all GitHub R Packages that we were able to extract a sector from, 78% are academic, 16% are business, 3% are government, and 3% are nonprfoit. When looking at the unique maintainers, 86% are academic, 8% are business, 3% are government, and 3% are nonprofit.

Code
# Calculate counts by sector (All packages on GitHub)
cran_repo_sector_counts <- cran_repos %>%
  filter(Sector != "Unknown") %>%
  count(Sector) %>%
  mutate(proportion = n / sum(n),
         proportion_label = paste0(round(proportion * 100, 1), "%")) 

# Save plot
cran_repo_sector_counts_plot <- ggplot(cran_repo_sector_counts, aes(x = Sector, y = n)) +
  geom_bar(stat = "identity", fill = westat_blue()) +
  geom_text(aes(label = proportion_label), vjust = -0.3) +
  ylab("Count of Packages") +
  ylim(c(0, 2000))+
  ggtitle(label = "Number of R Packages on GitHub by Maintainer's Sector")+ 
  labs(caption = "*70% Unknown for packages (removed from analysis)")+
  theme_clean()

cran_repo_sector_counts_plot

# Calculate counts by sector (For unique Maintainers on GitHub)
cran_repo_sector_counts_unique <- cran_repos_unique %>%
  distinct(email, .keep_all = T)%>%
  filter(Sector != "Unknown") %>%
  count(Sector) %>%
  mutate(proportion = n / sum(n),
         proportion_label = paste0(round(proportion * 100, 1), "%")) 

# Save plot
cran_repo_sector_counts_unique_plot <-ggplot(cran_repo_sector_counts_unique, aes(x = Sector, y = n)) +
  geom_bar(stat = "identity", fill = westat_blue()) +
  geom_text(aes(label = proportion_label), vjust = -0.3) +
  ylab("Count of Maintainers") +
  ylim(c(0, 2000))+
  ggtitle(label = "Number of R Package Maintainers on GitHub by Sector")+
  labs(caption = "*69% Unknown for unique maintainers (removed from analysis)")+
  theme_clean()

cran_repo_sector_counts_unique_plot

3.1.2.2 What are the top 10 institutions developing R Packages on GitHub based on the number of packages and number of unique maintainers?

For number of packages overall, Rstudio develops the most R packages on Github by a good margin. However, if we look at the unique maintainers only, the spread between Rstudio and other institutions becomes much smaller. It seems that their are a few maintainers that develop a lot of the R packages. We also note that those who do not have a sector will also not have an institution label (these coincide with one another).

Code
### sorting to the top 10 most common institutions for packages
top10_Institutions_GitHub <- sort(table(cran_repos$Institution), decreasing = T)
top10_Institutions_GitHub <- as.data.frame(head(top10_Institutions_GitHub, 10))

colnames(top10_Institutions_GitHub) <- c("Institution", "Freq")

### joining to institution dataframe to get sector variable
top10_Institutions_GitHub <- cran_repos %>% 
  right_join(top10_Institutions_GitHub, by = "Institution")%>%
  distinct(Institution, .keep_all = T)%>%
  select(Institution, Sector, Freq)%>%
  arrange(desc(Freq))

### sorting to the top 10 most common institutions for distinct maintainers
top10_Institutions_GitHub_unique <- sort(table(cran_repos_unique$Institution), decreasing = T)
top10_Institutions_GitHub_unique <- as.data.frame(head(top10_Institutions_GitHub_unique, 10))

colnames(top10_Institutions_GitHub_unique) <- c("Institution", "Freq")

### joining to institution unique dataframe to get sector variable
top10_Institutions_GitHub_unique <- cran_repos_unique %>% 
  right_join(top10_Institutions_GitHub_unique, by = "Institution")%>%
  distinct(Institution, .keep_all = T)%>%
  select(Institution, Sector, Freq)%>%
  arrange(desc(Freq))

### Graph output of top 10 institutions for packages
 ggplot(top10_Institutions_GitHub, aes(x = reorder(Institution, Freq), y = Freq, fill = Sector))+
   geom_bar(stat = "identity") +
    coord_flip() +
    scale_y_continuous(expand = c(0,0)) +
    labs(x = "", y = "Number of Packages",
         title = "Top 10 Institutions for R Packages on GitHub" ) +
    ylim(c(0, 300))+
  scale_fill_westat(option = "BLUES", drop = FALSE)+
  theme_clean()+
   theme(
  plot.title = element_text(size = 13))


 
 ### Graph output of top 10 institutions for unique maintainers
 ggplot(top10_Institutions_GitHub_unique, aes(x = reorder(Institution, Freq), y = Freq, fill = Sector))+
   geom_bar(stat = "identity") +
    coord_flip() +
    scale_y_continuous(expand = c(0,0)) +
    labs(x = "", y = "Number of Maintainers",
         title = "Top 10 Institutions for Unique Maintainers on GitHub" ) +
    ylim(c(0, 300))+
  scale_fill_westat(option = "BLUES", drop = FALSE)+
  theme_clean()+
   theme(
  plot.title = element_text(size = 13))

Code
### Table output of top 10 Institutions for packages
top10_Institutions_GitHub %>%
  kbl(caption = "Most Frequent Institutions for Packages on GitHub", escape = F)%>%
  kable_classic()%>%
  kable_styling(font_size = 12, full_width = T)%>%
 row_spec(0, bold = T, background = westat_blue(), color = "white")%>%
  column_spec(1:2, border_right = T)%>%
  scroll_box()
Most Frequent Institutions for Packages on GitHub
Institution Sector Freq
RStudio Business 278
University of California-Berkeley Academic 53
Harvard University Academic 51
NetEase Business 38
University of Wisconsin-Madison Academic 36
University of Oslo Academic 35
University of Michigan-Ann Arbor Academic 33
University College London Academic 32
University of Alberta Academic 28
French National Centre for Scientific Research Government 25
Code
### Table output of top 10 Institutions for unique maintainers
top10_Institutions_GitHub_unique %>%
  kbl(caption = "Most Frequent Institutions for Unique Maintainers on GitHub", escape = F)%>%
  kable_classic()%>%
  kable_styling(font_size = 12, full_width = T)%>%
 row_spec(0, bold = T, background = westat_blue(), color = "white")%>%
  column_spec(1:2, border_right = T)%>%
  scroll_box()
Most Frequent Institutions for Unique Maintainers on GitHub
Institution Sector Freq
RStudio Business 45
Harvard University Academic 24
University of Michigan-Ann Arbor Academic 22
NetEase Business 22
University of Washington-Seattle Campus Academic 19
University of California-Berkeley Academic 14
University College London Academic 14
University of Wisconsin-Madison Academic 14
Copenhagen University Academic 14
University of Oslo Academic 12

3.1.2.3 How are these distributions changing over time?

We can identify the year created by looking at the date and time the repository was created on github. This is one of the variables we collected during GitHub data collection.

We can now see how the distribution of sectors is changing over time and also identify patterns in years where we were able to identify the most sectors. We do the same type of analysis, one for sectors of all R packages on GitHub and one for sectors of all unique R maintainers on GitHub .

It looks like the ability to identify a sector generally increases from year to year all the way up until 2020, where there is a dip in the number of packages and maintainers being registered on GitHub. As for the sector distribution, it essentially stays the same from year to year for both plots. Academic makes a majority of the distribution, while there are slight fluctuations in the other sectors.

Code
cran_repos_time <- cran_repos %>%
  filter(Sector != "Unknown" & year_created != "NA" ) %>%
ggplot(aes(x = as.factor(year_created), fill = Sector)) +
  geom_bar() +
  labs(
    x = "Year",
    y = "Number of Packages",
    title = "Change in Sectors Over Time for R Packages on GitHub"
  ) +
  theme_clean()+
  scale_fill_westat(option = "BLUES", drop = FALSE)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))+
  ylim(c(0, 300))

cran_repos_time


cran_repos_unique_time <- cran_repos_unique %>%
  filter(Sector != "Unknown" & year_created != "NA" ) %>%
ggplot(aes(x = as.factor(year_created), fill = Sector)) +
  geom_bar() +
  labs(
    x = "Year",
    y = "Number of Maintainers",
    title = "Change in Sectors Over Time for Unique R Maintainers on GitHub"
  ) +
  theme_clean()+
  scale_fill_westat(option = "BLUES", drop = FALSE)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))+
  ylim(c(0, 300))


cran_repos_unique_time

3.1.3 GitHub R Package Contributors

Now we look at distributions of all R contributors on GitHub. After GitHub data collection, we were able to identify 14,328 unique R contributors.

Code
cran_users_unique <- cran_users %>%
                        distinct(login, .keep_all = T)

nrow(cran_users_unique)
[1] 14328

We also collected commit data for each of the unique R contributors. We join this back with our unique R contributors dataframe to combine commit, sector, country, and organization variables.

Code
### summing up total commits for all unique users of unique repos
user_commits_total <- user_commits %>%
                  group_by(slug, login) %>%
                  summarise(total_additions = sum(additions)) %>%
                  ungroup()

### join back to unique users dataframe for other variables
user_commits_total <- user_commits_total %>%
                        left_join(cran_users_unique, by = "login") %>%
                          select(slug, login,name, email, total_additions, organization, sector, country)
cran_repos2 <- cran_repos %>%
                select(slug, year_created, stargazer_count, fork_count, Downloads_All_Time, Downloads_Normalized, Reverse_Depends_Count)

user_commits_total <- user_commits_total %>%
                        left_join(cran_repos2, by = "slug")

### Rename NA sectors to Unknown 
user_commits_total <- user_commits_total %>%
  mutate(sector = ifelse(is.na(sector) | sector == "Unknown", "Unknown", sector))

3.1.3.1 What is the distribution of unique GitHub R contributors by sector?

For the 14,328 unique R contributors on GitHub, we were able to identify a sector for 2,573 of them. 1870 coming from academic, 482 from business, 84 from government, and 137 from nonprofit

Code
pander(table(cran_users_unique$sector, useNA = "always"))
Academic Business Government Nonprofit Unknown NA
1870 482 84 137 11755 0

For unique R developers (contributors to a slug) on GitHub, 73% are identified as academic, 19% as business, 5% as nonprofit, and 3% as government.

Code
# Calculate counts by sector (All packages on GitHub)
cran_user_sector_counts <- cran_users_unique %>%
  filter(sector != "NA" & sector != "Unknown") %>%
  count(sector) %>%
  mutate(proportion = n / sum(n),
         proportion_label = paste0(round(proportion * 100, 1), "%")) %>%
  arrange(desc(proportion)) %>%
  mutate(sector = factor(sector, levels = unique(sector)))

# Save plot
cran_user_sector_counts_plot <- ggplot(cran_user_sector_counts, aes(x = sector, y = n)) +
  geom_bar(stat = "identity", fill = westat_blue()) +
  geom_text(aes(label = proportion_label), vjust = -0.3) +
  ylab("Count of Developers") +
  ylim(c(0, 2000))+
  ggtitle(label = "Number of R Package Developers on GitHub by Sector")+ 
  labs(caption = "*Developers without sector information are removed in this figure (82% of 14,328 R Developers)")+
  theme_clean()

cran_user_sector_counts_plot

3.1.3.1.1 How do we attribute contribution to sectors (equal)?

We now aim to try to attribute contribution to sectors with a couple of methods. First, we look at equal contribution, where each member of a repository is given an equal fraction of credit regardless of level of contribution. So, if a repository has five members, each member will get .2 credit, and then the fractions are aggregated to the sectors. We will count the fraction to unknown sectors as well, but we will remove it in any graphical displays, as we already know this will be the highest percentage.

Note: This is different than looking at unique user distribution, as it will count repeat users if they are members of multiple repositories

Code
# 1. Count the number of unique login per slug.
login_counts <- user_commits_total %>%
  group_by(slug) %>%
  summarise(num_logins = n_distinct(login))

# 2. Compute the contribution fraction for each login.
user_commits_total <- user_commits_total %>%
  left_join(login_counts, by = "slug") %>%
  mutate(contribution_fraction_equal = 1 / num_logins) %>%
  select(-num_logins)  # Removing the num_logins column as it's no longer needed

# 3. Sum the contribution fraction for each sector per slug.
sector_contribution <- user_commits_total %>%
  group_by(slug, sector) %>%
  summarise(total_contribution_fraction = sum(contribution_fraction_equal))

# 4. Aggregate the contribution fraction for each sector across all slugs.
sector_aggregated <- sector_contribution %>%
  group_by(sector) %>%
  summarise(overall_contribution_fraction = sum(total_contribution_fraction))


# Calculate the total overall contribution fraction over all sectors
total_overall_contribution = sum(sector_aggregated$overall_contribution_fraction)

# Calculate the percentage contribution for each sector
sector_aggregated = sector_aggregated %>%
  mutate(percentage_contribution = round((overall_contribution_fraction / total_overall_contribution) * 100, 1))

### Plot percentage contribution
sector_aggregated$percentage_label <- scales::percent(sector_aggregated$percentage_contribution / 100)

Based on equal contribution of each unique login to each unique repository, we would attribute 80% of credit to the academic sector, 15% to the business, 2% to the government, and 3% to the nonprofit. Note that we removed Unknown from the distribution, where we would have to attribute 78% to. So, the percentage distributions listed here are based on the percentage we do know.

Code
### Excluding the unknown percentage in the table
total_excluding_unknown <- sum(sector_aggregated$overall_contribution_fraction[sector_aggregated$sector != "Unknown"])

### recalculating what percentages would be without unknown 
sector_aggregated <- sector_aggregated %>%
  mutate(percentage_contribution_excl_unknown = ifelse(sector != "Unknown", 
                                                       round((overall_contribution_fraction / total_excluding_unknown) * 100, 1), NA_real_))

### making labels
sector_aggregated$percentage_label_excl_unknown <- scales::percent(sector_aggregated$percentage_contribution_excl_unknown / 100, accuracy = 0.1)

ggplot(sector_aggregated %>% filter(sector != "Unknown"), aes(x = sector, y = percentage_contribution_excl_unknown)) +
  geom_bar(stat = "identity", fill = westat_blue()) +
  geom_text(aes(label = percentage_label_excl_unknown), vjust = -0.5, size = 4) +
  geom_text(aes(label = paste0("(", round(overall_contribution_fraction, 2), ")")), position = position_dodge(width = 0.9), vjust = -2.5)+
  labs(title = "Percentage Contribution by Sector (Equal)",
       x = "Sector",
       y = "Percentage Contribution") +
  theme_clean() +
  labs(caption = "*Excludes the percentage contribution from unknown sector (77.7%)")+
  ylim(0,100)

3.1.3.1.2 How do we attribute contribution to sectors based on lines of code?

We also can attribute contribution to sectors based on the lines of code written for a unique user of a given repository. The more lines of code added for that repository, the more credit that user will get. So, if a repository has 500 total lines of code, and one user wrote 300 of them, he/she would get .6 of the credit. We again apply the fractional counting method to the sectors after calculating this.

Code
# Calculate the total code additions for each slug (project/repository identifier)
# Grouping by the slug, and then summarizing the total additions for each slug.
slug_totals <- user_commits_total %>%
  group_by(slug) %>%
  summarise(total_code_for_slug = sum(total_additions))

# Compute the contribution fraction for each user.
# This is done by joining the user's total additions with the total code additions for their respective slug,
# and then computing the user's contribution as a fraction of the slug's total.
user_commits_total <- user_commits_total %>%
  left_join(slug_totals, by = "slug") %>%
  mutate(contribution_fraction_loc = total_additions / total_code_for_slug)

# Compute the total contribution fraction for each combination of slug and sector.
# This groups the data by slug and sector, and then sums up the contribution fractions.
sector_addition_contribution <- user_commits_total %>%
  group_by(slug, sector) %>%
  summarise(total_addition_contribution = sum(contribution_fraction_loc))

# Aggregate the contributions at the sector level.
# This groups by the sector and then computes the overall contribution fraction for each sector.
sector_aggregated_additions <- sector_addition_contribution %>%
  group_by(sector) %>%
  summarise(overall_addition_contribution = sum(total_addition_contribution, na.rm = TRUE))

# Compute the total overall additions across all sectors.
total_overall_additions = sum(sector_aggregated_additions$overall_addition_contribution)

# Calculate the percentage of additions for each sector relative to the total overall additions.
sector_aggregated_additions$percentage_additions = round((sector_aggregated_additions$overall_addition_contribution / total_overall_additions) * 100,1)

# Create a label for the percentage values, turning the decimal fraction into a percentage string (e.g., 0.5 becomes "50%").
sector_aggregated_additions$percentage_label_additions = scales::percent(sector_aggregated_additions$percentage_additions / 100)

After doing these calculations, we now see that 83% can be attributed to the academic sector, 13% to the business, 2% to the government, and 2% to the nonprofit. The original amount attributed to Unknown decreased to 75.6%

Code
# Calculate the total code additions while excluding the 'Unknown' sector.
total_excluding_unknown_add <- sum(sector_aggregated_additions$overall_addition_contribution[sector_aggregated_additions$sector != "Unknown"])

# Compute the percentage contribution for each sector relative to the total (excluding 'Unknown' sector).
# If the sector is 'Unknown', set the percentage as NA.
sector_aggregated_additions <- sector_aggregated_additions %>%
  mutate(percentage_contribution_excl_unknown = ifelse(sector != "Unknown", 
                                                       round((overall_addition_contribution / total_excluding_unknown_add) * 100, 1), NA_real_))

# Create a label for the percentage values that excludes 'Unknown' sector, turning the decimal fraction into a percentage string.
 sector_aggregated_additions$percentage_label_excl_unknown <- scales::percent(sector_aggregated_additions$percentage_contribution_excl_unknown / 100, accuracy = 0.1)

# Visualize data
ggplot(sector_aggregated_additions %>% filter(sector != "Unknown"), aes(x = sector, y = percentage_contribution_excl_unknown)) +
  geom_bar(stat = "identity", fill = westat_blue()) +
  geom_text(aes(label = percentage_label_excl_unknown), vjust = -0.5, size = 4) + # Adjust vjust and size as needed
  geom_text(aes(label = paste0("(", round(overall_addition_contribution, 2), ")")), position = position_dodge(width = 0.9), vjust = -2.5)+
  labs(#title = "Percentage Contribution by Sector (Weighted)",
       x = "Sector",
       y = "Percentage Contribution") +
  theme_clean() +
  #labs(caption = "*Excludes the percentage contribution from unknown sector (75.6%)")+
  ylim(0,100)+
  theme(axis.text = element_text(size = 14),
         axis.title = element_text(size = 12))

3.1.3.1.3 Attributing credit to Sectors over time
Code
# Compute the total contribution fraction for each combination of slug, sector, and year
sector_addition_contribution_time <- user_commits_total %>%
  group_by(slug, sector, year_created) %>%
  summarise(total_addition_contribution = sum(contribution_fraction_loc), .groups = 'drop')

# Aggregate the contributions at the sector and year level
sector_aggregated_additions_time <- sector_addition_contribution_time %>%
  group_by(sector, year_created) %>%
  summarise(overall_addition_contribution = sum(total_addition_contribution, na.rm = TRUE), .groups = 'drop')

# Compute the total overall additions across all sectors by year
total_overall_additions_by_year <- sector_aggregated_additions_time %>%
  group_by(year_created) %>%
  summarise(yearly_total = sum(overall_addition_contribution), .groups = 'drop')

# Calculate the percentage of additions for each sector relative to the total overall additions for each year
sector_aggregated_additions_time <- sector_aggregated_additions_time %>%
  left_join(total_overall_additions_by_year, by = "year_created") %>%
  mutate(percentage_additions = (overall_addition_contribution / yearly_total) * 100)

# Calculate the total code additions for each year while excluding the 'Unknown' sector
total_excluding_unknown_by_year <- sector_aggregated_additions_time %>%
  filter(sector != "Unknown") %>%
  group_by(year_created) %>%
  summarise(yearly_total_excl_unknown = sum(overall_addition_contribution), .groups = 'drop')

# Compute the percentage contribution for each sector by year relative to the year's total excluding 'Unknown'
sector_aggregated_additions_time <- sector_aggregated_additions_time %>%
  left_join(total_excluding_unknown_by_year, by = "year_created") %>%
  mutate(percentage_contribution_excl_unknown = ifelse(sector != "Unknown" & !is.na(yearly_total_excl_unknown), 
                                                       (overall_addition_contribution / yearly_total_excl_unknown) * 100, 
                                                       NA_real_))

# Round the percentages and create labels
sector_aggregated_additions_time$percentage_contribution_excl_unknown <- round(sector_aggregated_additions_time$percentage_contribution_excl_unknown, 1)
sector_aggregated_additions_time$percentage_label_excl_unknown <- ifelse(is.na(sector_aggregated_additions_time$percentage_contribution_excl_unknown),
                                                                    NA_character_,
                                                                    percent(sector_aggregated_additions_time$percentage_contribution_excl_unknown / 100))

# Assuming sector_aggregated_additions contains the necessary processed data
# Filter out the 'Unknown' sector for plotting
plot_data <- sector_aggregated_additions_time %>%
  filter(sector != "Unknown",
         year_created != "NA" & year_created != "2023")

The following graph shows fractional credit for sectors over time.

Code
# Stacked Bar Chart for Yearly Totals
R_Sectors_time <- ggplot(plot_data, aes(x = year_created, y = overall_addition_contribution, fill = sector)) +
  geom_bar(stat = "identity") +
  scale_fill_westat(option = "BLUES", drop = FALSE)+
  labs(x = "", y = "Fractional Count of Packages", title = "Fractional Count of Packages for Sector by Year") + # Fractional Count of Packages for Sector by Year, y-axis: Fractional Count of Packages
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom")

R_Sectors_time

Code
# ggsave(filename = "\\\\westat.com\\dfs\\DVSTAT\\Individual Directories\\Askew\\Paper_Data\\New Graphs\\R_Sectors_time.png", plot = R_Sectors_time, width = 8, height = 6, dpi = 300)
Code
# Line Chart for Percentages by Sector (excluding 'Unknown')
ggplot(plot_data,
       aes(x = year_created, y = percentage_contribution_excl_unknown, color = sector, group = sector)) +
  geom_line() +
  geom_point() +
  labs(x = "", y = "Percentage of Total Packages", title = "Weighted Sector Contribution by Year") +
  scale_color_westat(option = "BLUES", drop = FALSE) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")
Code
# Create the stacked bar plot
ggplot(plot_data, aes(x = year_created, y = percentage_contribution_excl_unknown, fill = sector)) +
  geom_bar(stat = "identity") +
  scale_fill_westat(option = "BLUES", drop = FALSE) +
  labs(x = "", y = "Percentage Contribution", title = "Weighted Sector Contribution by Year") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  # Adjust the angle of the x-axis labels for readability
        legend.position = "bottom")  # Place the legend at the bottom

3.1.3.2 What is the distribution of GitHub R contributors by country?

The diverstidy function, which we use to extract country from a user, can supply some messy data in terms of identifying multiple countries for a unique user. We first need to clean that up before analyzing country distributions. There were 427 unique users that had multiple countries supplied, so we manually went through and decided whether all countries should be kept, or some should be deleted. The country extracted can be based on email, location, company, or an organization that a given user has listed.

We filter out NA values here and replace with “Unknown”

Code
cran_users_unique <- cran_users_unique %>%
  mutate(
    country_fixed = strsplit(as.character(country), split = "\\|") %>%   # Split on "|"
      map(~unique(.)) %>%                                         # Keep only unique values
      sapply(paste, collapse = ",")                               # Collapse back into a string
  )

cran_users_unique <- cran_users_unique %>%
  mutate(country_fixed = ifelse(country_fixed == "NA", NA_character_, country_fixed))

cran_users_unique <- cran_users_unique %>%
  mutate(
    country_fixed = strsplit(country_fixed, split = ",") %>%  # Split on comma
      map(~ .[!. %in% "NA"]) %>%                             # Remove "NA" values (note the space before "NA")
      sapply(paste, collapse = ",")                           # Collapse back into a string
  )


cran_users_unique <- cran_users_unique %>%
                            left_join(user_countries, by = "login")

cran_users_unique <- cran_users_unique %>%
  mutate(country_final = ifelse(is.na(country_final), country_fixed, country_final))

cran_users_unique <- cran_users_unique %>%
  mutate(country_final = ifelse(is.na(country_final) | country_final == "NA", "Unknown", country_final))

Based on the unique R GitHub users, the United States is the most frequent country found followed by Germany and the United Kingdom. Out of 14,328 unique users, there were 5575 that we were unable to find a country for.

Code
### sum of Unknowns for country

sum(cran_users_unique$country_final == "Unknown")
[1] 5575
Code
### sorting to the top 10 most common countries for distinct GitHub users
top10_Countries_GitHub_users_unique <- cran_users_unique %>%
  filter(country_final != "Unknown")

top10_Countries_GitHub_users_unique <-  sort(table(top10_Countries_GitHub_users_unique$country_final), decreasing = T)

top10_Countries_GitHub_users_unique  <- as.data.frame(head(top10_Countries_GitHub_users_unique , 10))

colnames(top10_Countries_GitHub_users_unique ) <- c("country_final", "Freq")

 ### Graph output of top 10 countries for unique maintainers
 ggplot(top10_Countries_GitHub_users_unique , aes(x = reorder(country_final, Freq), y = Freq))+
   geom_bar(stat = "identity", fill = westat_blue()) +
    coord_flip() +
    scale_y_continuous(expand = c(0,0)) +
    labs(x = "", y = "Number of GitHub Users",
         title = "Top 10 Countries for R Users on GitHub" ) +
    ylim(c(0, 3000))+
  scale_fill_westat(option = "BLUES")+
  theme_clean()+
   theme(
  plot.title = element_text(size = 13))+
   labs(caption = "*Excludes count from unknown countries (5575)")

Code
### Table output of top 10 Institutions for packages
top10_Countries_GitHub_users_unique %>%
  kbl(caption = "Most Frequent Countries for R Developers on GitHub", escape = F)%>%
  kable_classic()%>%
  kable_styling(font_size = 12, full_width = T)%>%
 row_spec(0, bold = T, background = westat_blue(), color = "white")%>%
  column_spec(1:2, border_right = T)%>%
  scroll_box()
Most Frequent Countries for R Developers on GitHub
country_final Freq
United States 2809
Germany 854
United Kingdom 660
Canada 410
France 352
Australia 322
China 286
Netherlands 264
Switzerland 226
India 214
3.1.3.2.1 How do we attribute contribution to countries (equal)?

As stated prior, there are some logins that have multiple countries listed. For these logins, we split the contribution fractions for equal and lines of code equally among the countries. So, if a user had two countries in a slug with 4 unique users, each country will get .125 credit based on equal contribution. For lines of code, if that user had 500 additions, each country would get 250 additions. After doing this, we see that there are 123 unique countries identified.

Code
# Function to handle the splitting and division for multiple countries
process_multiple_countries <- function(df) {
  num_countries <- length(str_split(df$country_final, ",\\s*")[[1]])
  df %>%
    separate_rows(country_final, sep = ",\\s*") %>%
    mutate(
      total_additions = total_additions / num_countries,
      contribution_fraction_equal = contribution_fraction_equal / num_countries,
      contribution_fraction_loc = contribution_fraction_loc / num_countries
    )
}

# join country variable back to commit table
user_countries <- cran_users_unique %>%
                    select(login, country_final)

user_commits_total <- user_commits_total %>%
                            left_join(user_countries, by = "login")

# Replace NA values in 'country_final' with 'Unknown'
user_commits_total$country_final[is.na(user_commits_total$country_final)] <- "Unknown"

# Process rows with multiple countries
multi_country_rows <- user_commits_total %>%
  filter(str_detect(country_final, ",")) %>%
  group_by(login) %>%
  do(process_multiple_countries(.))

# Exclude multi-country rows from the original df and bind the processed rows
user_commits_total <- user_commits_total %>%
  filter(!str_detect(country_final, ",")) %>%
  bind_rows(multi_country_rows)

Instead of grouping by sector, we have to group by country here.

Code
# Sum the contribution fraction for each sector per slug.
country_contribution <- user_commits_total %>%
  group_by(slug, country_final) %>%
  summarise(total_contribution_fraction = sum(contribution_fraction_equal))

# Aggregate the contribution fraction for each country across all slugs
country_aggregated <- country_contribution %>%
  group_by(country_final) %>%
  summarise(overall_contribution_fraction = sum(total_contribution_fraction))

# Calculate the total overall contribution fraction over all countries
total_overall_contribution = sum(country_aggregated$overall_contribution_fraction)

# Calculate the percentage contribution for each country
country_aggregated = country_aggregated %>%
  mutate(percentage_contribution = round((overall_contribution_fraction / total_overall_contribution) * 100, 1))

### Plot percentage contribution
country_aggregated$percentage_label <- scales::percent(country_aggregated$percentage_contribution / 100)

If we give equal contributions to countries, then the United states would get 31.1% of the credit followed by Germany with 10.9% credit. This excludes the contribution counted towards unknown (38.5%), so these percentages are based on the percentage that we know (61.5%).

Code
total_excluding_unknown <- sum(country_aggregated$overall_contribution_fraction[country_aggregated$country_final != "Unknown"])

country_aggregated <- country_aggregated %>%
  mutate(percentage_contribution_excl_unknown = ifelse(country_final != "Unknown", 
                                                       round((overall_contribution_fraction / total_excluding_unknown) * 100, 1), NA_real_))


country_aggregated$percentage_label_excl_unknown <- scales::percent(country_aggregated$percentage_contribution_excl_unknown / 100, accuracy = 0.1)


top_10_countries <- country_aggregated %>%
  arrange(desc(percentage_contribution_excl_unknown)) %>%
  head(10)

ggplot(top_10_countries, aes(x = reorder(country_final, percentage_contribution_excl_unknown),  y = percentage_contribution_excl_unknown)) +
  geom_bar(stat = "identity", fill = westat_blue()) +
  geom_text(aes(label = percentage_label_excl_unknown), vjust = .5, size = 4, hjust = -.25) +
  geom_text(aes(label = paste0("(", round(overall_contribution_fraction, 2), ")")), position = position_dodge(width = 0.9), vjust = .25, hjust = -1)+# Adjust vjust and size as needed
  labs(title = "Percentage Contribution by Country (Equal - Top 10 Countries)",
       x = "Country",
       y = "Percentage Contribution") +
  theme_clean() +
  ylim(0,100)+
  coord_flip()+
  theme(plot.title = element_text(size = 10))+
  labs(caption = "*Excludes the percentage contribution from unknown countries (38.5%)")

3.1.3.2.2 How do we attribute contribution to countries based on lines of code?

Now, we base the contribution on additions for country just as we did for sector.

Code
country_addition_contribution <- user_commits_total %>%
  group_by(slug, country_final) %>%
  summarise(total_addition_contribution = sum(contribution_fraction_loc))

country_aggregated_additions <- country_addition_contribution %>%
  group_by(country_final) %>%
  summarise(overall_addition_contribution = sum(total_addition_contribution, na.rm = TRUE))

total_overall_additions = sum(country_aggregated_additions$overall_addition_contribution)

country_aggregated_additions$percentage_additions = round((country_aggregated_additions$overall_addition_contribution / total_overall_additions) * 100,1)

country_aggregated_additions$percentage_label_additions = scales::percent(country_aggregated_additions$percentage_additions / 100)

Based on additions, the percentage attributed towards unknwon decreases to 34.1%, so the percentage that we know increases to 65.9% overall. United states still is at the top, but it decreases slightly to 30.9%. The top 10 and the order of the top 10 stays the same, but the percentages increase slightly for the ones more towards the bottom.

Code
total_excluding_unknown <- sum(country_aggregated_additions$overall_addition_contribution[country_aggregated_additions$country_final != "Unknown"])

country_aggregated_additions <- country_aggregated_additions %>%
  mutate(percentage_contribution_excl_unknown = ifelse(country_final != "Unknown", 
                                                       round((overall_addition_contribution / total_excluding_unknown) * 100, 1), NA_real_))


country_aggregated_additions$percentage_label_excl_unknown <- scales::percent(country_aggregated_additions$percentage_contribution_excl_unknown / 100, accuracy = 0.1)

top_10_countries_additions <- country_aggregated_additions %>%
  arrange(desc(percentage_contribution_excl_unknown)) %>%
  head(10)

 ggplot(top_10_countries_additions, aes(x = reorder(country_final, percentage_contribution_excl_unknown),  y = percentage_contribution_excl_unknown)) +
  geom_bar(stat = "identity", fill = westat_blue()) +
  geom_text(aes(label = percentage_label_excl_unknown), vjust = .5, size = 6, hjust = -.12) + 
  geom_text(aes(label = paste0("(", round(overall_addition_contribution, 2), ")")), position = position_dodge(width = 0.9), vjust = .5, hjust = -1.1, size = 5)+# Adjust vjust and size as needed
  labs( x = "",
       y = "Percentage Contribution") +
  theme_clean() +
  ylim(0, 100)+
  coord_flip()+
  theme(axis.text = element_text(size = 14),
         axis.title = element_text(size = 12))

3.1.3.2.3 Attributing Credit to Countries over time
Code
# Step 1: Sum the contribution fraction for each country per slug, per year
country_contribution_by_year <- user_commits_total %>%
  group_by(slug, country_final, year_created) %>%
  summarise(total_contribution_fraction = sum(contribution_fraction_loc, na.rm =  T), .groups = 'drop')

# Step 2: Aggregate the contribution fraction for each country by year
country_aggregated_by_year <- country_contribution_by_year %>%
  group_by(country_final, year_created) %>%
  summarise(overall_contribution_fraction = sum(total_contribution_fraction), .groups = 'drop')

# Step 3: Exclude 'Unknown' and determine the top ten countries for each year
country_aggregated_by_year_excl_unknown <- country_aggregated_by_year %>%
  filter(country_final != "Unknown")

# Step 4: Calculate the total overall contribution by year, excluding 'Unknown'
total_overall_contribution_by_year_excl_unknown <- country_aggregated_by_year_excl_unknown %>%
  group_by(year_created) %>%
  summarise(yearly_total_excl_unknown = sum(overall_contribution_fraction), .groups = 'drop')

# Now compute the percentage of contribution for each of the top countries, excluding 'Unknown'
country_aggregated_by_year_excl_unknown <- country_aggregated_by_year_excl_unknown %>%
  left_join(total_overall_contribution_by_year_excl_unknown, by = "year_created") %>%
  mutate(percentage_contribution_excl_unknown = (overall_contribution_fraction / yearly_total_excl_unknown) * 100) %>%
  arrange(year_created, desc(percentage_contribution_excl_unknown))

# Step 5: Get the top ten countries by year, excluding 'Unknown'
top_countries_by_year_excl_unknown <- country_aggregated_by_year_excl_unknown %>%
  group_by(year_created) %>%
  top_n(10, wt = percentage_contribution_excl_unknown) %>%
  ungroup()

# Filter out the 'Unknown' sector for plotting
plot_data2 <- top_countries_by_year_excl_unknown %>%
  filter(year_created != "NA" & year_created != "2023")


plot_data3 <- top_countries_by_year_excl_unknown %>%
  filter(year_created != "NA" & year_created != "2023",
         country_final %in% c("United States", "Germany", "United Kingdom", "France", "Canada", "Australia", "Netherlands", "Switzerland", "Spain", "China"))

The following graph shows all countries fractional credit over time

Code
my_colors <- c("#FF0000", "#00FF00", "#0000FF", "#FFFF00", "#FF00FF", "#00FFFF", "#000000",
               "#800000", "#008000", "#000080", "#808000", "#800080", "#008080", "#808080",
               "#C00000", "#00C000", "#0000C0", "#C0C000", "#C000C0", "#00C0C0",
               "#400000", "#004000", "#000040", "#404000", "#400040") # Define more colors as needed


# Stacked Bar Chart for Yearly Totals
ggplot(plot_data2, aes(x = year_created, y = overall_contribution_fraction, fill = country_final)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = my_colors) +
  labs(x = "", y = "Number of Packages", title = "Weighted Country Contribution by Year", fill = "Country") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom")

This subsets to the top 10 that we identified previously

Code
my_colors <- c("#6B8E23", "#8FBC8F", "#2E8B57", "#4682B4", "#87CEEB",
               "#4169E1", "#B0C4DE", "#D2691E", "#CD853F", "#F4A460")





# Stacked Bar Chart for Yearly Totals
R_Country_time <- ggplot(plot_data3, aes(x = year_created, y = overall_contribution_fraction, fill = country_final)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = my_colors) +
  labs(x = "", y = "Fractional Count of Packages", title = "Top Countries by Fractional Count of Packages", fill = "Country") + # Top Countries by Fractional Count of Packages, y-axis: Fractional Count of Packages
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom")

R_Country_time

Code
## ggsave(filename = "\\\\westat.com\\dfs\\DVSTAT\\Individual Directories\\Askew\\Paper_Data\\New Graphs\\R_Country_time.png", plot = R_Country_time, width = 8, height = 6, dpi = 300)

3.1.3.3 What is the distribution of unique GitHub R contributors by organization?

We also have the organization variable for some users. It works with the sector variable, so if we were not able to identify a sector, we also were not able to identify an organization.

Code
# Replace NA values in 'organization' with 'Unknown'
user_commits_total$organization[is.na(user_commits_total$organization)] <- "Unknown"
user_commits_total$organization[user_commits_total$organization == "NA"] <- "Unknown"
cran_users_unique$organization[cran_users_unique$organization == "NA"] <- "Unknown"

If we look at the top 10 most frequent organizations for unique R developers on Github, Google has the most with 86 followed by NetEase with 57. Only one in the top 10 is from a sector other than business or academic (Broad Institute - nonprofit)

Code
cran_users_unique <- cran_users_unique %>%
                          filter(organization != "Unknown")

### sorting to the top 10 most common institutions for distinct GitHub users
top10_Institutions_GitHub_users_unique <- sort(table(cran_users_unique$organization), decreasing = T)


top10_Institutions_GitHub_users_unique <- as.data.frame(head(top10_Institutions_GitHub_users_unique, 10))

colnames(top10_Institutions_GitHub_users_unique) <- c("organization", "Freq")

### joining to institution unique dataframe to get sector variable
top10_Institutions_GitHub_users_unique <- cran_users_unique %>% 
  right_join(top10_Institutions_GitHub_users_unique, by = "organization")%>%
  distinct(organization, .keep_all = T)%>%
  select(organization, sector, Freq)%>%
  arrange(desc(Freq))

 ### Graph output of top 10 institutions for unique maintainers
 ggplot(top10_Institutions_GitHub_users_unique, aes(x = reorder(organization, Freq), y = Freq, fill = sector))+
   geom_bar(stat = "identity") +
    coord_flip() +
    scale_y_continuous(expand = c(0,0)) +
    labs(x = "", y = "Number of GitHub Users",
         title = "Top 10 Organizations for Unique R Users on GitHub" ) +
    ylim(c(0, 200))+
  scale_fill_westat(option = "BLUES", drop = FALSE)+
  theme_clean()+
   theme(
  plot.title = element_text(size = 13))+
   labs(caption = "*Those without org info are removed in this figure (82% of 14,328 R Developers)")

Code
### Table output of top 10 Institutions for packages
top10_Institutions_GitHub_users_unique %>%
  kbl(caption = "Most Frequent Institutions for R Developers on GitHub", escape = F)%>%
  kable_classic()%>%
  kable_styling(font_size = 12, full_width = T)%>%
 row_spec(0, bold = T, background = westat_blue(), color = "white")%>%
  column_spec(1:2, border_right = T)%>%
  scroll_box()
Most Frequent Institutions for R Developers on GitHub
organization sector Freq
Google Business 86
NetEase Business 57
University of California-Berkeley Academic 46
Broad Institute Nonprofit 45
University of Michigan-Ann Arbor Academic 41
University of Washington-Seattle Campus Academic 40
Harvard University Academic 40
Microsoft Business 40
RStudio Business 33
Smith College Academic 32
3.1.3.3.1 How do we attribute contribution to organization (equal)?

We now will look at equal contribution for organizations

Code
# Sum the contribution fraction for each organization per slug.
org_contribution <- user_commits_total %>%
  group_by(slug, organization) %>%
  summarise(total_contribution_fraction = sum(contribution_fraction_equal))

# Aggregate the contribution fraction for organization across all slugs
org_aggregated <- org_contribution %>%
  group_by(organization) %>%
  summarise(overall_contribution_fraction = sum(total_contribution_fraction))

# Calculate the total overall contribution fraction over all organizations
total_overall_contribution = sum(org_aggregated$overall_contribution_fraction)

# Calculate the percentage contribution for each organization
org_aggregated = org_aggregated %>%
  mutate(percentage_contribution = round((overall_contribution_fraction / total_overall_contribution) * 100, 1))

### Plot percentage contribution
org_aggregated$percentage_label <- scales::percent(org_aggregated$percentage_contribution / 100)

Again, the equal percentage contribution to Unknown is 77.7% just like we saw in the sector contribution section. Of the percentage we do know (611 different organizations), Rstudio leads with 7.2% followed by UCLA with 3%

Code
total_excluding_unknown <- sum(org_aggregated$overall_contribution_fraction[org_aggregated$organization != "Unknown"])

org_aggregated <- org_aggregated %>%
  mutate(percentage_contribution_excl_unknown = ifelse(organization != "Unknown", 
                                                       round((overall_contribution_fraction / total_excluding_unknown) * 100, 1), NA_real_))


org_aggregated$percentage_label_excl_unknown <- scales::percent(org_aggregated$percentage_contribution_excl_unknown / 100, accuracy = 0.1)


top_10_orgs <- org_aggregated %>%
  arrange(desc(percentage_contribution_excl_unknown)) %>%
  head(10)

ggplot(top_10_orgs, aes(x = reorder(organization, percentage_contribution_excl_unknown),  y = percentage_contribution_excl_unknown)) +
  geom_bar(stat = "identity", fill = westat_blue()) +
  geom_text(aes(label = percentage_label_excl_unknown), vjust = .5, size = 4, hjust = -.25) +
    geom_text(aes(label = paste0("(", round(overall_contribution_fraction, 2), ")")), position = position_dodge(width = 0.9), vjust = .25, hjust = -1)+# Adjust vjust and size as needed
  labs(title = "Percentage Contribution by Organization (Equal - Top 10 Organizations)",
       x = "Organization",
       y = "Percentage Contribution") +
  theme_clean() +
  ylim(0,100)+
  coord_flip()+
  theme(plot.title = element_text(size = 7))+
  labs(caption = "*Excludes the percentage contribution from unknown organizations (77.7%)")

3.1.3.3.2 How do we attribute contribution to organizations based on lines of code?

Now, we base the contribution on additions for organization just as we did for sector and country.

Code
org_addition_contribution <- user_commits_total %>%
  group_by(slug, organization) %>%
  summarise(total_addition_contribution = sum(contribution_fraction_loc))

org_aggregated_additions <- org_addition_contribution %>%
  group_by(organization) %>%
  summarise(overall_addition_contribution = sum(total_addition_contribution, na.rm = TRUE))

total_overall_additions = sum(org_aggregated_additions$overall_addition_contribution)

org_aggregated_additions$percentage_additions = round((org_aggregated_additions$overall_addition_contribution / total_overall_additions) * 100,1)

org_aggregated_additions$percentage_label_additions = scales::percent(org_aggregated_additions$percentage_additions / 100) 

Based on additions, the percentage contribution towards unknown is 75.6% just as we saw for sector, which is what we expect because the two variables coincide with one another. The percentage coming from Rstudio decreases to 5.9% (still number one), and the top 10 along with the order of the top 10 changes slightly. Notably, Monash University moves from the 10th position to the 4th position when factoring in additions.

Code
total_excluding_unknown <- sum(org_aggregated_additions$overall_addition_contribution[org_aggregated_additions$organization != "Unknown"])

org_aggregated_additions <- org_aggregated_additions %>%
  mutate(percentage_contribution_excl_unknown = ifelse(organization != "Unknown", 
                                                       round((overall_addition_contribution / total_excluding_unknown) * 100, 1), NA_real_))


org_aggregated_additions$percentage_label_excl_unknown <- scales::percent(org_aggregated_additions$percentage_contribution_excl_unknown / 100, accuracy = 0.1)


top_10_orgs_additions <- org_aggregated_additions %>%
  arrange(desc(percentage_contribution_excl_unknown)) %>%
  head(10)

ggplot(top_10_orgs_additions, aes(x = reorder(organization, percentage_contribution_excl_unknown),  y = percentage_contribution_excl_unknown)) +
  geom_bar(stat = "identity", fill = westat_blue()) +
  geom_text(aes(label = percentage_label_excl_unknown), vjust = .5, size = 4, hjust = -.25) +
    geom_text(aes(label = paste0("(", round(overall_addition_contribution, 2), ")")), position = position_dodge(width = 0.9), vjust = .25, hjust = -1)+# Adjust vjust and size as needed
  labs(title = "Percentage Contribution by Organization (Weighted - Top 10 Organizations)",
       x = "Organization",
       y = "Percentage Contribution") +
  theme_clean() +
  ylim(0,100)+
  coord_flip()+
  theme(plot.title = element_text(size = 7))+
  labs(caption = "*Excludes the percentage contribution from unknown (75.6%)")

3.2 Network Analysis

We create edgelists for countries and sectors in the following section

3.2.1 Countries

  • What are the overall structural features of the OSS networks? How do they differ across fields, sectors, institutions, and countries? Units of analysis (OSS actors): projects, categories, developers, institutions, sectors, countries

  • What are the different communities that can be identified using structural features of the networks? Do they correspond to similarities in languages, methods, location, culture?

Code
### select dependency information for slugs and packages
cran_github_rdi <- cran_github %>%
                      select(Package, slug, Depends)

### rename columns
colnames(cran_github_rdi) <- c("Citing_Package", "slug", "Dependencies")


### Package citation column will be the unlisted dependencies column
cran_github_rdi$Package_Citation <- cran_github_rdi$Dependencies


### join commits information for the citing packages
cran_github_RDI <- cran_github_rdi %>%
                      inner_join(user_commits_total, by = "slug")%>%
                        select(Citing_Package, slug, Dependencies, login,
                              country_final, total_additions, total_code_for_slug,
                              contribution_fraction_loc, Package_Citation) %>%
                       # Remove rows with NA in Depends
                        filter(!is.na(Package_Citation))

### rename columns on the basis of the citing package
colnames(cran_github_RDI) <- c("Citing_Package", "Citing_Slug", "Dependencies", "Citing_Login",  "Citing_Country",
                                "Citing_Additions", "Citing_Total_Slug_Additions", "Citing_Package_Fraction" , "Package_Citation")


### unlist the dependencies for joining
cran_github_RDI_network <-  cran_github_RDI %>%
  separate_rows(Package_Citation, sep = ",\\s*") %>%
  filter(Package_Citation != "")


#### prepare commits information for cited packages
user_commits_rdi <- user_commits_total %>%
  mutate(Package_Citation = str_split(slug, "/", simplify = TRUE)[, 2])%>%
  select(login, country_final, total_additions, total_code_for_slug, contribution_fraction_loc, Package_Citation)
  
  colnames(user_commits_rdi) <- c( "Cited_Login", "Cited_Country", 
                                   "Cited_Additions", "Cited_Total_Slug_Additions", "Cited_Package_Fraction", "Package_Citation")
  
  ### join cited package commit information to citing package dataframe
  cran_github_rdi_full <- cran_github_RDI_network %>%
                                        inner_join(user_commits_rdi, by = "Package_Citation")

  ### create dependency_fraction = citing package fraction multiplied by cited package fraction 
  cran_github_rdi_grouped <- cran_github_rdi_full %>%
  mutate(Dependency_Fraction = Citing_Package_Fraction * Cited_Package_Fraction)
Code
# Group by Cited Country and Citing Country, and sum Dependency_Fraction

### the number of citations made from one country to another is simply the sum of the fractioned scores associated with each pair, with the sum across all possible pairs adding up to the total number of citations made at the world level.

dependency_summary <- cran_github_rdi_grouped %>%
  group_by(Cited_Country, Citing_Country) %>%
  summarize(Total_Dependency_Fraction = sum(Dependency_Fraction, na.rm = TRUE))

sum(dependency_summary$Total_Dependency_Fraction)
[1] 589
Code
# Group by Cited Country and sum Total_Dependency_Fraction - total number of citations attributed to each country
citations_by_country <- dependency_summary %>%
  group_by(Cited_Country) %>%
  summarize(Fraction_of_Citations = round(sum(Total_Dependency_Fraction, na.rm = TRUE), 4))


sum(citations_by_country$Fraction_of_Citations)
[1] 589.0001
Code
citations_by_country$Denominator_RDI <- round(citations_by_country$Fraction_of_Citations / sum(citations_by_country$Fraction_of_Citations),4)

# Group by citing country and sum Total_Dependency_Fraction - total number of citations made by each country
citings_by_country <- dependency_summary %>%
  group_by(Citing_Country) %>%
  summarize(Fraction_of_Citings = round(sum(Total_Dependency_Fraction, na.rm = TRUE), 4))


sum(citings_by_country$Fraction_of_Citings)
[1] 588.9999
Code
# join citings by country with dependency_summary

citings_dependency_summary <- citings_by_country %>%
                                full_join(dependency_summary, by = "Citing_Country")

citings_dependency_summary$Numerator_RDI <- round(citings_dependency_summary$Total_Dependency_Fraction / citings_dependency_summary$Fraction_of_Citings,4)

## join denominator_RDI

citations_citings_dependency_summary <- citations_by_country %>%
                                full_join(citings_dependency_summary, by = "Cited_Country") %>%
                                select(Citing_Country, Cited_Country, Numerator_RDI, Denominator_RDI)

citations_citings_dependency_summary$RDI <- round(citations_citings_dependency_summary$Numerator_RDI / citations_citings_dependency_summary$Denominator_RDI,4)
Code
dependency_summary %>%
  arrange(desc(Total_Dependency_Fraction))%>%
  kbl(caption = "Country Pair Dependency Weights", escape = F)%>%
  kable_classic()%>%
  kable_styling(font_size = 12, full_width = T)%>%
 row_spec(0, bold = T, background = westat_blue(), color = "white")%>%
  column_spec(1:2, border_right = T)%>%
  scroll_box(width = "100%", height = "500px")
Country Pair Dependency Weights
Cited_Country Citing_Country Total_Dependency_Fraction
Unknown Unknown 87.3146359
United States Unknown 64.3139658
United States United States 35.7966513
Unknown United States 26.3714621
France Unknown 14.4800845
Unknown Germany 12.1686281
Unknown United Kingdom 11.7170320
Germany Unknown 10.3942012
Norway Norway 8.2998583
United States Germany 7.9622908
Germany Germany 7.6496787
Unknown Netherlands 6.8692472
United States Spain 6.8043969
Denmark Unknown 6.7001129
Unknown Canada 6.3906341
Canada Unknown 6.3692538
United States Italy 6.1850725
United States Australia 6.1608042
Unknown France 5.9665531
Unknown Australia 5.7756594
Bulgaria Unknown 5.2967973
United States United Kingdom 5.1454252
United States France 4.8013768
Germany United States 4.7021219
Unknown New Zealand 4.4227168
Unknown Belgium 4.1871672
United Kingdom Unknown 4.1108012
Netherlands Unknown 3.6034512
Austria United States 3.5561262
Unknown Italy 3.3489010
Denmark Denmark 3.1556103
France United States 3.1419765
Unknown Poland 3.0670811
Germany Canada 2.8483983
Denmark United States 2.6590604
Australia Unknown 2.6535606
Unknown Brazil 2.6226700
United States Netherlands 2.6049659
Australia Australia 2.5753766
United States Canada 2.5707256
Canada Canada 2.5509336
Germany Australia 2.5390527
Norway Unknown 2.4723814
Denmark Germany 2.3466958
France France 2.1811472
United States Switzerland 2.1505801
United States South Korea 2.0812051
Unknown Ireland 2.0721542
Unknown Denmark 2.0536379
United States New Zealand 2.0487277
United States Ireland 2.0308880
United Kingdom Netherlands 2.0298811
United States Brazil 2.0081528
Italy Italy 2.0003613
Hong Kong Hong Kong 2.0000000
Romania Romania 2.0000000
Unknown Sweden 1.9932171
Australia United States 1.9867464
Israel Unknown 1.9366085
United States Denmark 1.9037860
Canada United States 1.8759120
Germany Spain 1.8035616
Germany United Kingdom 1.7853307
Colombia Unknown 1.7366579
Netherlands Netherlands 1.7265773
United States Peru 1.6861346
Netherlands Germany 1.6542048
United Kingdom United States 1.6175734
Norway United States 1.5929450
Netherlands United States 1.5654457
Germany Netherlands 1.5589904
Germany Israel 1.5411524
Spain Ecuador 1.4848368
United States Belgium 1.4739769
United States China 1.4634399
United Kingdom United Kingdom 1.4321661
Denmark United Kingdom 1.4135857
Unknown Colombia 1.3794315
United States Mauritius 1.3733200
Canada United Kingdom 1.3526687
United States Romania 1.3411052
Unknown Spain 1.3233865
Germany Norway 1.3159755
France Netherlands 1.3057481
United States Austria 1.2618838
Canada New Zealand 1.2552483
France Spain 1.2341293
Bulgaria Denmark 1.2013438
United States Czech Republic 1.1903569
Unknown Peru 1.1818778
Spain Unknown 1.1622256
Japan Unknown 1.1621650
Spain Italy 1.1597043
Denmark Netherlands 1.1553843
United States Taiwan 1.1521244
Sweden United States 1.1132434
Switzerland United Kingdom 1.0602851
France United Kingdom 1.0338293
Unknown Switzerland 1.0211730
China Unknown 1.0137011
Brazil Unknown 1.0059937
New Zealand France 1.0010112
Denmark Singapore 1.0000000
Russia Russia 1.0000000
Israel France 0.9998031
Spain Norway 0.9992790
Unknown Fiji 0.9977071
Israel United States 0.9972796
Netherlands New Caledonia 0.9862442
Austria Poland 0.9847822
Denmark Poland 0.9838720
Italy Australia 0.9748095
Switzerland Unknown 0.9725535
Unknown Austria 0.9647271
France New Caledonia 0.9581810
France Hong Kong 0.9545580
Finland Unknown 0.9527975
Poland France 0.9423647
Unknown Portugal 0.9380079
United States Nigeria 0.9345863
Unknown Czech Republic 0.9224215
Netherlands China 0.9220582
Germany Colombia 0.8995402
Germany Italy 0.8971911
France Germany 0.8920140
Belgium Canada 0.8573285
Canada Sweden 0.8527621
United States Israel 0.8506106
Italy Ecuador 0.8386308
Canada Australia 0.7810893
Bulgaria United States 0.7757486
Germany Czech Republic 0.7691053
Denmark Australia 0.7651278
France Canada 0.7521236
Greece Unknown 0.7242482
Unknown Norway 0.7230468
Lithuania Switzerland 0.7128352
Australia Russia 0.7117685
United States Lithuania 0.6849539
Germany South Korea 0.6785448
United States Poland 0.6559873
Germany France 0.6452659
Sweden Unknown 0.6360338
Unknown Singapore 0.6316281
Bulgaria Poland 0.6177956
Germany Chile 0.6159934
Bulgaria Austria 0.6148564
Australia Spain 0.6140359
Norway Spain 0.5967587
Norway Canada 0.5853739
Unknown South Korea 0.5825402
Bulgaria France 0.5774379
France Belgium 0.5595327
Canada Germany 0.5285778
Unknown Curaçao 0.5121700
Austria Unknown 0.5114360
United States Sweden 0.5072172
Netherlands Peru 0.5069443
France Ireland 0.4821894
Germany Switzerland 0.4633026
Finland Germany 0.4559689
Finland Norway 0.4482860
United States Colombia 0.4478594
United States Norway 0.4472529
United States New Caledonia 0.4326497
United States Uruguay 0.4324207
Australia Denmark 0.4310748
United States Panama 0.4309000
Japan United States 0.4096919
Unknown China 0.4071286
France Colombia 0.4012124
Norway Australia 0.3983430
Colombia Germany 0.3970159
France Switzerland 0.3871114
France Peru 0.3841060
Bulgaria Canada 0.3784731
Switzerland United States 0.3773445
Canada Netherlands 0.3664586
Unknown Saudi Arabia 0.3570105
Bulgaria Netherlands 0.3368614
Greece Netherlands 0.3339926
Denmark New Zealand 0.3323449
Bulgaria United Kingdom 0.3290682
Norway South Korea 0.3219288
Spain United States 0.3163808
France Romania 0.3092703
Netherlands Ireland 0.3083117
Sweden Denmark 0.3068182
Australia Israel 0.2985812
Netherlands Lithuania 0.2962873
Canada Italy 0.2948238
Switzerland Italy 0.2919480
Unknown New Caledonia 0.2887299
Canada Colombia 0.2822796
Unknown Uruguay 0.2717018
Unknown Panama 0.2707463
Netherlands Switzerland 0.2706131
Unknown Taiwan 0.2691071
France Mauritius 0.2640198
Switzerland Netherlands 0.2574750
Bulgaria China 0.2567134
Bulgaria Taiwan 0.2526643
Sweden Sweden 0.2498985
Denmark Canada 0.2355911
Unknown Mauritius 0.2304751
Unknown Romania 0.2230253
Denmark Italy 0.2197927
Colombia Spain 0.2175700
Denmark Austria 0.2122025
France Brazil 0.2111520
Unknown Greece 0.2090994
Denmark Norway 0.2075330
Bulgaria Czech Republic 0.2022632
Denmark China 0.2005919
Australia New Caledonia 0.1962969
Canada Czech Republic 0.1952466
Japan Australia 0.1880885
Denmark Peru 0.1878818
Switzerland Switzerland 0.1844749
Poland Unknown 0.1843025
Canada Norway 0.1813057
Denmark Belgium 0.1758419
Denmark Brazil 0.1752140
United Kingdom Italy 0.1630648
Canada Austria 0.1621032
Netherlands Italy 0.1619164
Canada China 0.1577323
Australia France 0.1505368
France Sweden 0.1497168
Belgium Unknown 0.1493700
Denmark Ireland 0.1483553
France Australia 0.1465289
Canada Peru 0.1457408
Japan Germany 0.1442236
Japan United Kingdom 0.1423299
Canada Denmark 0.1404967
Denmark New Caledonia 0.1374681
Denmark Uruguay 0.1372391
Denmark Panama 0.1367564
Denmark South Korea 0.1365204
Unknown Ecuador 0.1360882
Denmark Taiwan 0.1352712
Canada Brazil 0.1349812
Norway United Kingdom 0.1315918
United States Greece 0.1260161
Colombia Greece 0.1258970
France Italy 0.1192478
Australia New Zealand 0.1169271
Colombia United States 0.1139295
Canada Ireland 0.1109200
Sweden Germany 0.1107721
Colombia United Kingdom 0.1096227
Netherlands Australia 0.1086190
Canada Taiwan 0.1079066
Canada Poland 0.1071618
Canada South Korea 0.1052102
Norway Germany 0.1049158
Canada New Caledonia 0.1047443
Canada Uruguay 0.1047443
Canada Panama 0.1043759
Switzerland Austria 0.1006642
Switzerland Norway 0.1006271
Bulgaria Germany 0.0961149
China United Kingdom 0.0934190
China Australia 0.0928251
China Romania 0.0926876
Australia Netherlands 0.0900973
United States Iran 0.0895339
Colombia Switzerland 0.0866403
Lithuania Unknown 0.0850502
Switzerland Canada 0.0831040
Norway Italy 0.0777531
Norway Czech Republic 0.0776187
Norway Israel 0.0775624
Switzerland Germany 0.0753103
Denmark Switzerland 0.0744288
Switzerland Israel 0.0671474
Japan Netherlands 0.0670224
Bulgaria Italy 0.0664073
Unknown Israel 0.0642599
Netherlands United Kingdom 0.0642527
Canada Belgium 0.0635688
Norway Chile 0.0623432
United States Portugal 0.0618976
Belgium United States 0.0613799
Switzerland Australia 0.0612307
Switzerland Czech Republic 0.0606923
Canada Switzerland 0.0589714
Germany New Zealand 0.0584768
Belgium Australia 0.0569835
Switzerland Saudi Arabia 0.0568921
Sweden Greece 0.0561070
Norway France 0.0558811
France Denmark 0.0556274
France Iran 0.0512067
Unknown Russia 0.0505592
Japan Austria 0.0505173
Denmark Czech Republic 0.0503192
Australia Italy 0.0489262
Switzerland Chile 0.0485827
Japan China 0.0476934
New Zealand Unknown 0.0469499
Japan Peru 0.0462711
Japan Canada 0.0448302
Switzerland France 0.0445385
Canada France 0.0441761
Colombia Brazil 0.0440714
Japan Denmark 0.0439129
Japan Brazil 0.0425713
United States Russia 0.0414637
United Kingdom France 0.0409140
Japan New Zealand 0.0388648
Italy Unknown 0.0388300
Unknown Iran 0.0377344
France New Zealand 0.0372241
Japan Italy 0.0370687
Colombia Netherlands 0.0366625
Germany Sweden 0.0362615
Canada Israel 0.0358376
Bulgaria Spain 0.0342123
Japan Norway 0.0338562
Japan South Korea 0.0333496
Japan Poland 0.0333440
Japan Uruguay 0.0326517
Japan New Caledonia 0.0326517
Japan Panama 0.0325369
Japan Taiwan 0.0321835
Germany Belgium 0.0319093
France Argentina 0.0308185
Poland Denmark 0.0294982
United Kingdom Singapore 0.0294045
Bulgaria Nigeria 0.0289694
United Kingdom New Caledonia 0.0286144
Unknown Croatia 0.0280926
Canada Chile 0.0280021
Australia Norway 0.0262178
Spain France 0.0262091
Poland United States 0.0252270
United Kingdom Hong Kong 0.0249606
Spain Spain 0.0243645
United Kingdom Canada 0.0240402
Australia Germany 0.0237971
Japan Ireland 0.0237783
Netherlands Brazil 0.0237204
Germany Peru 0.0223014
Netherlands Austria 0.0221971
Australia United Kingdom 0.0214205
Japan Belgium 0.0208931
Bulgaria Belgium 0.0202719
Netherlands Canada 0.0199530
New Zealand United States 0.0198287
Netherlands South Korea 0.0196425
Unknown Vietnam 0.0193056
Netherlands Denmark 0.0192819
United Kingdom Australia 0.0189649
United Kingdom Germany 0.0182602
Austria Australia 0.0179254
Poland Israel 0.0177945
France South Korea 0.0175913
Germany Romania 0.0167968
United States Curaçao 0.0167415
United Kingdom Norway 0.0167285
Unknown Hong Kong 0.0165701
Finland United States 0.0165472
Netherlands Norway 0.0165328
Japan Switzerland 0.0164776
Bulgaria Ireland 0.0164602
Belgium Germany 0.0161245
Bulgaria Peru 0.0159461
Taiwan Unknown 0.0158341
South Korea Sweden 0.0158018
Australia Czech Republic 0.0157652
New Zealand Australia 0.0152377
Poland Poland 0.0151597
Poland Austria 0.0151359
Austria United Kingdom 0.0151351
Italy Spain 0.0148336
Germany Mauritius 0.0146504
Netherlands Poland 0.0146380
Italy Colombia 0.0143472
Netherlands New Zealand 0.0143121
Netherlands Uruguay 0.0143048
Netherlands Panama 0.0142545
Netherlands Taiwan 0.0142112
United Kingdom Ireland 0.0142092
Netherlands Belgium 0.0139892
United States South Africa 0.0134590
Germany Saudi Arabia 0.0133996
Poland Netherlands 0.0129269
Germany Poland 0.0128339
Taiwan United States 0.0125941
Spain Poland 0.0125781
Australia Chile 0.0125428
Taiwan Australia 0.0119497
United States Hungary 0.0119127
Greece United States 0.0117563
Denmark France 0.0117198
Colombia Australia 0.0115513
Japan Czech Republic 0.0114539
Chile Unknown 0.0114306
Greece Germany 0.0113962
Poland Czech Republic 0.0112703
Unknown Chile 0.0110268
United Kingdom Colombia 0.0104980
Bulgaria Australia 0.0104409
United Kingdom Czech Republic 0.0102863
United Kingdom Israel 0.0102854
Bulgaria New Zealand 0.0102791
Poland Canada 0.0101044
Colombia New Zealand 0.0100556
Spain United Kingdom 0.0096010
Australia Nigeria 0.0092871
United States Saudi Arabia 0.0092137
Colombia Denmark 0.0090860
Bulgaria Lithuania 0.0090474
Germany Brazil 0.0087518
United Kingdom Spain 0.0082342
Poland United Kingdom 0.0081089
Spain Brazil 0.0080278
Bulgaria Switzerland 0.0079930
Unknown Nigeria 0.0079273
United Kingdom Chile 0.0078933
Bulgaria Romania 0.0073016
Bulgaria Mauritius 0.0069345
Unknown South Africa 0.0069098
Unknown Finland 0.0067922
Spain South Korea 0.0066939
France Poland 0.0066835
Finland Belgium 0.0065901
Poland Italy 0.0064869
Norway Saudi Arabia 0.0063399
Ukraine Unknown 0.0063074
Singapore Australia 0.0060268
France Austria 0.0058384
France Norway 0.0056087
Denmark Nigeria 0.0055300
United States Chile 0.0055177
Denmark Israel 0.0055002
Germany Denmark 0.0054771
Spain Germany 0.0053937
Austria France 0.0052599
New Caledonia Unknown 0.0052364
Netherlands Czech Republic 0.0052323
Italy United States 0.0051895
Denmark Spain 0.0051378
France China 0.0051080
France Israel 0.0048205
Portugal Unknown 0.0046360
Spain Canada 0.0045777
Finland Netherlands 0.0045139
Finland Peru 0.0045076
Denmark Chile 0.0042128
Colombia France 0.0041863
Colombia Sweden 0.0040938
Singapore Unknown 0.0040324
Chile United States 0.0040093
Ukraine Italy 0.0039478
Germany Ecuador 0.0038602
Switzerland Ecuador 0.0038602
Austria Netherlands 0.0037632
United Kingdom Nigeria 0.0036867
Poland Norway 0.0036790
Switzerland Spain 0.0036213
Austria New Caledonia 0.0036085
Austria Hong Kong 0.0036029
Japan France 0.0035910
Poland Germany 0.0035711
Denmark Sweden 0.0035000
Spain Switzerland 0.0034732
France Uruguay 0.0033940
Uganda Unknown 0.0033897
France Panama 0.0033821
France Taiwan 0.0033638
United States Croatia 0.0033457
Italy Norway 0.0033121
United Kingdom South Korea 0.0032662
Sweden Australia 0.0032072
United Kingdom Brazil 0.0031792
Brazil United States 0.0031430
United Kingdom New Zealand 0.0031081
Finland Spain 0.0031065
Finland Romania 0.0031065
Spain Australia 0.0030526
United Arab Emirates Unknown 0.0030249
Belgium Brazil 0.0030008
Spain New Zealand 0.0029926
Finland United Kingdom 0.0029745
Unknown Lithuania 0.0029621
Finland Mauritius 0.0029502
Mexico Unknown 0.0029216
Japan Spain 0.0028533
Argentina Unknown 0.0028179
Germany Iran 0.0028132
Canada Nigeria 0.0027801
Belgium South Korea 0.0027615
Austria Canada 0.0027188
Canada Spain 0.0026913
Kenya Unknown 0.0026790
New Zealand Germany 0.0026715
France Czech Republic 0.0026421
Netherlands France 0.0025869
China United States 0.0025423
United Kingdom Denmark 0.0025162
United States Singapore 0.0025121
Finland France 0.0024718
United Kingdom Sweden 0.0024510
Poland Australia 0.0024258
New Zealand Spain 0.0024141
China Spain 0.0023455
Unknown Japan 0.0023344
New Zealand United Kingdom 0.0023064
Ukraine United States 0.0022998
Spain Ireland 0.0022905
Spain Netherlands 0.0022856
Portugal Belgium 0.0022806
Belgium Peru 0.0022471
Colombia Poland 0.0021876
Netherlands Spain 0.0021545
Italy Germany 0.0020929
New Zealand Ireland 0.0020238
Canada Curaçao 0.0020178
Portugal United States 0.0020133
Netherlands Romania 0.0020044
Switzerland Belgium 0.0019900
Greenland Unknown 0.0019842
Spain Belgium 0.0019662
Antarctica Unknown 0.0018835
Norway Brazil 0.0018735
New Zealand Lithuania 0.0018420
Netherlands Mauritius 0.0018407
Finland Denmark 0.0018355
Chile Australia 0.0018312
South Africa Unknown 0.0018268
Brazil New Zealand 0.0018180
Austria Ireland 0.0017710
Poland Chile 0.0017650
Sweden Netherlands 0.0017490
Japan Nigeria 0.0017124
Italy Israel 0.0016718
Sweden United Kingdom 0.0016704
Denmark Croatia 0.0016311
Antarctica United States 0.0016147
Belgium New Caledonia 0.0016039
India Mauritius 0.0015996
Bulgaria Brazil 0.0015963
Uruguay Canada 0.0015718
Germany Austria 0.0015611
Sweden Spain 0.0015609
Spain China 0.0015585
Sweden France 0.0015392
Australia Switzerland 0.0015271
United Kingdom Austria 0.0015167
Austria Colombia 0.0015143
Portugal Netherlands 0.0015132
Spain Taiwan 0.0015128
United Arab Emirates United States 0.0015084
United Arab Emirates Australia 0.0014849
Bulgaria Sweden 0.0014771
Austria Germany 0.0014466
Sweden Ireland 0.0014365
Chile United Kingdom 0.0014351
Belgium Spain 0.0014311
Chile Germany 0.0014294
Brazil United Kingdom 0.0014177
Sweden Saudi Arabia 0.0014128
Germany Ireland 0.0014026
China Germany 0.0013954
Finland Australia 0.0013888
Spain Peru 0.0013796
United Kingdom Poland 0.0013763
Bulgaria Iran 0.0013449
Bulgaria South Korea 0.0013220
United Kingdom Peru 0.0013107
China Denmark 0.0012872
Finland Canada 0.0012629
New Zealand New Zealand 0.0012489
Australia Peru 0.0012300
Colombia Belgium 0.0012251
Austria Belgium 0.0012134
Sweden Switzerland 0.0011991
Japan Romania 0.0011935
Uganda United States 0.0011899
New Zealand Italy 0.0011746
Singapore Germany 0.0011616
Japan Israel 0.0011439
Italy Czech Republic 0.0011382
Germany Lithuania 0.0011377
Japan Mauritius 0.0011335
New Zealand Norway 0.0010773
France Chile 0.0010677
Brazil Australia 0.0010652
China Switzerland 0.0010574
China Netherlands 0.0010487
Ukraine Germany 0.0010473
Brazil Germany 0.0010455
New Zealand Switzerland 0.0010339
United Kingdom Switzerland 0.0010188
Japan Sweden 0.0010038
India Unknown 0.0009953
Italy France 0.0009951
Mexico United States 0.0009933
Turkey Unknown 0.0009694
Finland Austria 0.0009617
Argentina United States 0.0009523
Canada South Africa 0.0009517
Brazil France 0.0009503
Brazil Ireland 0.0009488
Kenya United States 0.0009271
Ukraine Australia 0.0009261
Finland Poland 0.0009261
Belgium United Kingdom 0.0009257
New Zealand Peru 0.0009190
China France 0.0009143
Taiwan Spain 0.0009137
Italy Chile 0.0009114
India United States 0.0008787
Italy Brazil 0.0008740
New Zealand Netherlands 0.0008702
Colombia Peru 0.0008680
Norway Netherlands 0.0008580
Norway Peru 0.0008514
Germany Mexico 0.0008489
United Kingdom China 0.0008407
Colombia Canada 0.0008375
Austria Peru 0.0008286
Bulgaria Norway 0.0008100
Bulgaria New Caledonia 0.0008099
Bulgaria Uruguay 0.0008099
Bulgaria Panama 0.0008070
United Kingdom Argentina 0.0008059
New Zealand Brazil 0.0007711
United Kingdom Russia 0.0007700
Germany New Caledonia 0.0007478
Australia Brazil 0.0007444
Ukraine United Kingdom 0.0007331
New Zealand Israel 0.0007185
Greenland United States 0.0006965
Italy South Korea 0.0006902
New Zealand Canada 0.0006897
Czech Republic Unknown 0.0006842
Chile Netherlands 0.0006695
Canada Romania 0.0006688
Colombia Romania 0.0006599
Netherlands Sweden 0.0006440
United Kingdom Taiwan 0.0006393
Sweden Russia 0.0006311
Switzerland Nigeria 0.0006245
China Poland 0.0006180
Switzerland Brazil 0.0006172
China Austria 0.0006165
Italy Switzerland 0.0006101
France Curaçao 0.0006048
Austria Spain 0.0006045
Austria Romania 0.0006045
China Canada 0.0005998
China Italy 0.0005960
New Zealand Czech Republic 0.0005925
United Kingdom Belgium 0.0005866
Japan Chile 0.0005787
Austria Mauritius 0.0005741
Finland Iran 0.0005722
Antarctica France 0.0005699
Israel Sweden 0.0005644
New Caledonia United States 0.0005570
France Saudi Arabia 0.0005523
Uganda Australia 0.0005465
Unknown Argentina 0.0005420
Mexico Australia 0.0005396
Netherlands Israel 0.0005282
Uzbekistan Unknown 0.0005236
Italy United Kingdom 0.0005235
Antarctica Germany 0.0005186
Antarctica Ireland 0.0005173
Chile Austria 0.0005113
Spain Romania 0.0005106
Antarctica Australia 0.0005023
Belgium Switzerland 0.0004956
United States Vietnam 0.0004873
Spain Mauritius 0.0004849
Chile China 0.0004830
China Brazil 0.0004818
Australia Canada 0.0004706
Chile Peru 0.0004558
Chile Canada 0.0004524
Finland Sweden 0.0004471
Chile Denmark 0.0004428
Austria Sweden 0.0004346
New Zealand Denmark 0.0004333
Argentina Australia 0.0004301
Australia Austria 0.0004301
Netherlands South Africa 0.0004290
Uganda United Kingdom 0.0004276
Kenya Australia 0.0004257
Brazil Switzerland 0.0004257
Australia Singapore 0.0004242
Germany China 0.0004241
Netherlands Chile 0.0004224
Chile Brazil 0.0004217
Uganda Germany 0.0004207
Sweden Belgium 0.0004170
Kenya Italy 0.0004158
Australia Belgium 0.0004111
Australia China 0.0004104
New Zealand Chile 0.0004022
Norway Mexico 0.0004020
Switzerland Peru 0.0003968
Greece Italy 0.0003939
Mexico Germany 0.0003866
United Kingdom Uruguay 0.0003857
New Zealand Austria 0.0003849
United Kingdom Panama 0.0003843
Denmark Colombia 0.0003788
Hungary Unknown 0.0003641
Mexico Mauritius 0.0003635
Taiwan Switzerland 0.0003634
Mexico United Kingdom 0.0003591
New Zealand China 0.0003576
Netherlands Iran 0.0003570
Spain Denmark 0.0003525
Canada Lithuania 0.0003433
Chile Poland 0.0003373
Argentina United Kingdom 0.0003362
Argentina Germany 0.0003359
Chile Norway 0.0003308
Chile New Zealand 0.0003307
Chile New Caledonia 0.0003307
Chile Uruguay 0.0003307
Greece France 0.0003306
Kenya United Kingdom 0.0003302
Chile Panama 0.0003295
United States Bulgaria 0.0003287
Kenya Germany 0.0003286
Chile South Korea 0.0003284
Chile Italy 0.0003263
Sweden Brazil 0.0003262
Chile Taiwan 0.0003259
Ukraine Netherlands 0.0003244
Colombia Ireland 0.0003226
Greenland Australia 0.0003199
China New Caledonia 0.0003148
Sweden Peru 0.0003111
China Hong Kong 0.0003084
Singapore Italy 0.0003035
Germany Russia 0.0003016
Colombia Mauritius 0.0002995
Finland Czech Republic 0.0002985
Belgium New Zealand 0.0002953
Australia Poland 0.0002931
Germany Uruguay 0.0002898
Ukraine Peru 0.0002887
Germany Panama 0.0002887
Australia Taiwan 0.0002886
Germany Taiwan 0.0002857
Colombia Austria 0.0002794
Czech Republic United States 0.0002765
France Greece 0.0002752
Norway Austria 0.0002726
Colombia Italy 0.0002699
South Korea Unknown 0.0002657
New Zealand Romania 0.0002650
Switzerland New Zealand 0.0002595
New Zealand Poland 0.0002587
Australia Uruguay 0.0002552
Norway China 0.0002544
Australia Panama 0.0002543
Australia South Korea 0.0002535
Unknown Hungary 0.0002507
Greenland United Kingdom 0.0002503
Australia Bulgaria 0.0002501
Brazil Italy 0.0002499
China Ireland 0.0002488
Ukraine Austria 0.0002488
Colombia China 0.0002466
Greenland Germany 0.0002463
New Zealand New Caledonia 0.0002448
New Zealand Uruguay 0.0002448
New Zealand Panama 0.0002439
New Zealand South Korea 0.0002431
New Zealand Taiwan 0.0002413
Chile Ireland 0.0002383
Norway Denmark 0.0002368
Italy Netherlands 0.0002359
Bulgaria Curaçao 0.0002340
Switzerland Denmark 0.0002328
Ukraine China 0.0002322
Belgium Italy 0.0002315
Antarctica Switzerland 0.0002276
Colombia Curaçao 0.0002251
Norway Ireland 0.0002247
Belgium France 0.0002243
Japan Iran 0.0002198
Ukraine Canada 0.0002178
Ukraine Denmark 0.0002168
Lithuania Germany 0.0002166
Uruguay Germany 0.0002104
South Korea Italy 0.0002093
Taiwan Germany 0.0002048
Belgium Netherlands 0.0002040
Ukraine Brazil 0.0002026
China Czech Republic 0.0002016
Kosovo Germany 0.0002013
Sweden Romania 0.0002003
Uganda Netherlands 0.0001980
Chile Belgium 0.0001971
Spain Sweden 0.0001940
Portugal Germany 0.0001919
Peru United Kingdom 0.0001906
Sweden Mauritius 0.0001903
Taiwan Brazil 0.0001844
Australia Ireland 0.0001842
Uzbekistan United States 0.0001838
Hungary United States 0.0001833
Colombia South Korea 0.0001818
Norway Poland 0.0001803
Sweden Israel 0.0001802
United States Ecuador 0.0001800
Unknown India 0.0001791
United States Finland 0.0001751
Colombia Norway 0.0001746
Norway New Zealand 0.0001742
Norway New Caledonia 0.0001742
Norway Uruguay 0.0001742
Hungary Australia 0.0001739
Norway Panama 0.0001736
United Kingdom Romania 0.0001729
Uruguay Unknown 0.0001728
Norway Taiwan 0.0001717
Italy Sweden 0.0001707
Finland Curaçao 0.0001692
Colombia New Caledonia 0.0001686
Colombia Uruguay 0.0001686
Colombia Panama 0.0001680
Colombia Taiwan 0.0001661
Czech Republic Norway 0.0001657
Chile Switzerland 0.0001654
Luxembourg Unknown 0.0001654
United Kingdom Mauritius 0.0001642
Mexico Netherlands 0.0001642
Colombia Lithuania 0.0001625
Ukraine Poland 0.0001623
Germany Nigeria 0.0001612
Ukraine Norway 0.0001590
Ukraine New Zealand 0.0001589
Ukraine New Caledonia 0.0001589
Ukraine Uruguay 0.0001589
Ukraine Panama 0.0001584
Ukraine South Korea 0.0001578
Argentina Netherlands 0.0001569
Ukraine Taiwan 0.0001566
Greece Norway 0.0001557
New Zealand Belgium 0.0001537
Kenya Netherlands 0.0001529
Uganda Austria 0.0001526
Singapore France 0.0001518
Belgium China 0.0001514
Singapore Nigeria 0.0001511
Taiwan Netherlands 0.0001504
Belgium Taiwan 0.0001489
Mexico Peru 0.0001473
Belgium Poland 0.0001472
Finland China 0.0001468
Uganda China 0.0001442
Switzerland South Korea 0.0001428
Lithuania United States 0.0001422
Nigeria Italy 0.0001412
Czech Republic Germany 0.0001405
Singapore Spain 0.0001403
United States Turkey 0.0001395
Norway Belgium 0.0001390
Ukraine Belgium 0.0001387
Netherlands Curaçao 0.0001378
Latvia Unknown 0.0001378
Uganda Canada 0.0001347
Switzerland China 0.0001340
Uganda Peru 0.0001340
Belgium Sweden 0.0001336
Mexico Italy 0.0001331
Hong Kong Unknown 0.0001330
Uganda Denmark 0.0001321
Norway Switzerland 0.0001316
Czech Republic Australia 0.0001300
Finland Brazil 0.0001300
China Colombia 0.0001296
Mexico Austria 0.0001266
Uganda Brazil 0.0001259
Switzerland Sweden 0.0001248
Argentina Austria 0.0001236
Australia Romania 0.0001225
Germany South Africa 0.0001198
Mexico China 0.0001196
Denmark South Africa 0.0001190
Kenya Austria 0.0001182
Australia Mauritius 0.0001163
Austria Argentina 0.0001163
Greenland Netherlands 0.0001159
Ukraine Ireland 0.0001148
China Belgium 0.0001133
Turkey China 0.0001127
Argentina China 0.0001126
Argentina Denmark 0.0001119
Mexico Canada 0.0001117
Kenya China 0.0001114
Austria Iran 0.0001113
Turkey Taiwan 0.0001111
Mexico Denmark 0.0001096
Chile Czech Republic 0.0001086
Bulgaria Bulgaria 0.0001084
Argentina Canada 0.0001079
Finland Italy 0.0001050
Argentina Peru 0.0001046
Mexico Brazil 0.0001044
Switzerland Romania 0.0001042
Kenya Canada 0.0001041
Kenya Peru 0.0001035
Poland China 0.0001028
Greece Czech Republic 0.0001022
Kenya Denmark 0.0001020
Czech Republic Italy 0.0001015
Uganda Poland 0.0001007
Finland New Zealand 0.0001003
Finland Uruguay 0.0001003
Finland New Caledonia 0.0001003
Finland Panama 0.0001000
Finland South Korea 0.0000996
Unknown Bulgaria 0.0000995
Switzerland Mauritius 0.0000989
Finland Taiwan 0.0000989
Czech Republic Czech Republic 0.0000989
Uganda Norway 0.0000987
Uganda New Zealand 0.0000987
Uganda New Caledonia 0.0000987
Uganda Uruguay 0.0000987
Uganda Panama 0.0000984
Argentina Brazil 0.0000983
Uganda South Korea 0.0000981
Czech Republic Israel 0.0000976
Uganda Italy 0.0000974
Uganda Taiwan 0.0000973
Kenya Brazil 0.0000972
Switzerland Taiwan 0.0000971
Portugal Peru 0.0000955
Spain Iran 0.0000940
Canada Mauritius 0.0000940
Greece Australia 0.0000939
Greece Israel 0.0000939
Niue Australia 0.0000918
Italy Poland 0.0000917
Italy Portugal 0.0000917
France Bulgaria 0.0000915
Australia South Africa 0.0000907
Denmark Lithuania 0.0000905
Japan Colombia 0.0000901
Switzerland Poland 0.0000900
Greenland Austria 0.0000894
Singapore Colombia 0.0000883
Sweden Canada 0.0000878
Unknown Turkey 0.0000877
South Africa United States 0.0000871
Italy Denmark 0.0000865
Uzbekistan Australia 0.0000844
Greenland China 0.0000844
Mexico Poland 0.0000835
Argentina Poland 0.0000831
Argentina Norway 0.0000831
Kenya Norway 0.0000823
Ukraine Switzerland 0.0000820
Mexico Norway 0.0000819
Mexico New Zealand 0.0000819
Mexico New Caledonia 0.0000819
Mexico Uruguay 0.0000819
Nepal United States 0.0000819
Mexico Panama 0.0000816
Mexico South Korea 0.0000813
Mexico Taiwan 0.0000807
Belgium Nigeria 0.0000806
Spain Austria 0.0000805
Argentina Italy 0.0000796
Poland Romania 0.0000795
Bulgaria Croatia 0.0000792
Greenland Canada 0.0000789
Greenland Peru 0.0000784
Czech Republic Chile 0.0000781
Russia Unknown 0.0000780
Kenya Poland 0.0000778
Greenland Denmark 0.0000773
Argentina New Zealand 0.0000771
Argentina New Caledonia 0.0000771
Argentina Uruguay 0.0000771
Argentina Panama 0.0000768
Argentina South Korea 0.0000765
Kenya New Zealand 0.0000763
Kenya New Caledonia 0.0000763
Kenya Uruguay 0.0000763
Kenya Panama 0.0000760
Argentina Taiwan 0.0000759
Kenya South Korea 0.0000757
Greece Chile 0.0000752
Kenya Taiwan 0.0000752
Switzerland Ireland 0.0000747
China Sweden 0.0000743
Greenland Brazil 0.0000737
Finland Ireland 0.0000723
Egypt Unknown 0.0000723
France Lithuania 0.0000722
Belgium Denmark 0.0000715
Uganda Ireland 0.0000711
Finland Switzerland 0.0000708
Portugal Spain 0.0000704
Portugal Romania 0.0000704
Czech Republic France 0.0000702
Brazil Netherlands 0.0000676
Portugal Mauritius 0.0000669
Uzbekistan United Kingdom 0.0000660
Colombia Czech Republic 0.0000651
Uzbekistan Germany 0.0000650
Hong Kong United States 0.0000642
Australia Sweden 0.0000626
Israel Israel 0.0000616
Switzerland Uruguay 0.0000610
Switzerland New Caledonia 0.0000610
Switzerland Panama 0.0000608
United States Japan 0.0000597
Mexico Ireland 0.0000590
Greenland Poland 0.0000589
Colombia Iran 0.0000581
Belgium Ireland 0.0000581
Luxembourg United States 0.0000580
Greenland Norway 0.0000578
Greenland New Zealand 0.0000578
Greenland New Caledonia 0.0000578
Greenland Uruguay 0.0000578
Greenland Panama 0.0000576
Greenland South Korea 0.0000574
Greenland Italy 0.0000570
Greenland Taiwan 0.0000570
Nepal Unknown 0.0000567
Uganda Belgium 0.0000558
Argentina Ireland 0.0000555
Sweden Poland 0.0000553
Kenya Ireland 0.0000549
United Kingdom Lithuania 0.0000542
Ukraine Czech Republic 0.0000522
Spain New Caledonia 0.0000514
Spain Uruguay 0.0000514
Spain Panama 0.0000512
Uganda Switzerland 0.0000494
Brazil Canada 0.0000492
Belgium Norway 0.0000487
New Zealand Sweden 0.0000486
Latvia United States 0.0000484
Hong Kong Norway 0.0000479
Portugal United Kingdom 0.0000479
China New Zealand 0.0000475
Mexico Belgium 0.0000462
Nigeria Unknown 0.0000460
Ireland Unknown 0.0000457
Israel Australia 0.0000454
Netherlands Colombia 0.0000454
Niue Unknown 0.0000449
Sweden Austria 0.0000447
Denmark Turkey 0.0000443
Finland Finland 0.0000438
Argentina Belgium 0.0000435
Kenya Belgium 0.0000431
Portugal Vietnam 0.0000426
Sweden China 0.0000422
Antarctica Poland 0.0000420
Argentina Switzerland 0.0000420
Greenland Ireland 0.0000416
Brazil Brazil 0.0000414
Mexico Switzerland 0.0000409
Portugal France 0.0000409
Netherlands Nigeria 0.0000403
Costa Rica Unknown 0.0000398
South Africa Germany 0.0000392
Brazil Poland 0.0000389
Kenya Switzerland 0.0000381
Taiwan Denmark 0.0000371
Sweden Iran 0.0000369
Japan Lithuania 0.0000362
China Lithuania 0.0000361
Italy Russia 0.0000357
Niue Russia 0.0000357
Egypt United States 0.0000357
Netherlands Vietnam 0.0000353
Poland Peru 0.0000349
Portugal Australia 0.0000342
Russia Israel 0.0000341
Canada Turkey 0.0000338
Austria Denmark 0.0000337
Finland Singapore 0.0000336
Germany Curaçao 0.0000334
Switzerland Vietnam 0.0000331
Poland Brazil 0.0000328
Greenland Belgium 0.0000326
Uganda Czech Republic 0.0000324
Belgium Czech Republic 0.0000324
United Kingdom Iran 0.0000318
Bulgaria South Africa 0.0000317
Brazil Austria 0.0000312
Uzbekistan Netherlands 0.0000306
Argentina Czech Republic 0.0000304
Antarctica New Zealand 0.0000303
Colombia Nigeria 0.0000302
New Zealand Nigeria 0.0000302
Spain Nigeria 0.0000302
Singapore United States 0.0000299
Brazil Denmark 0.0000298
Brazil China 0.0000293
Brazil South Korea 0.0000291
Hong Kong Germany 0.0000290
Belgium Israel 0.0000289
Hong Kong Australia 0.0000289
Hong Kong Czech Republic 0.0000289
Hong Kong Israel 0.0000289
Hong Kong Italy 0.0000289
Sweden Norway 0.0000289
Greenland Switzerland 0.0000289
Sweden New Zealand 0.0000289
Sweden New Caledonia 0.0000289
Sweden Uruguay 0.0000289
Sweden Italy 0.0000288
Sweden Panama 0.0000288
Brazil Peru 0.0000288
Sweden South Korea 0.0000287
Kenya Czech Republic 0.0000286
Sweden Taiwan 0.0000285
South Africa United Kingdom 0.0000279
Iceland Unknown 0.0000276
Poland South Africa 0.0000269
Mexico Czech Republic 0.0000269
Luxembourg Australia 0.0000267
Brazil Norway 0.0000261
South Africa Belgium 0.0000260
Poland New Zealand 0.0000257
Poland New Caledonia 0.0000257
Poland Uruguay 0.0000257
Spain Greece 0.0000257
Poland Panama 0.0000256
Poland South Korea 0.0000255
Poland Taiwan 0.0000253
Israel United Kingdom 0.0000251
Ukraine South Africa 0.0000240
Unknown Luxembourg 0.0000236
Uzbekistan Austria 0.0000236
Belgium Chile 0.0000231
Hong Kong Chile 0.0000231
India Denmark 0.0000231
South Africa Italy 0.0000226
Australia Iran 0.0000226
Uzbekistan China 0.0000223
Latvia Australia 0.0000222
Cyprus Unknown 0.0000222
Germany Vietnam 0.0000217
South Africa Spain 0.0000217
Australia Fiji 0.0000212
Ukraine Curaçao 0.0000212
Spain Czech Republic 0.0000209
Luxembourg United Kingdom 0.0000209
Colombia Bulgaria 0.0000208
Uzbekistan Canada 0.0000208
Hong Kong France 0.0000208
Uzbekistan Peru 0.0000207
South Africa Netherlands 0.0000206
Ireland Italy 0.0000206
Luxembourg Germany 0.0000205
Uzbekistan Denmark 0.0000204
Israel Denmark 0.0000203
Norway Romania 0.0000203
Brazil New Caledonia 0.0000201
Brazil Uruguay 0.0000201
Brazil Panama 0.0000200
South Africa Peru 0.0000200
Russia Germany 0.0000199
Brazil Taiwan 0.0000198
Uzbekistan Brazil 0.0000194
Antarctica Canada 0.0000194
Egypt Germany 0.0000194
Norway Mauritius 0.0000193
Switzerland Iran 0.0000192
Antarctica United Kingdom 0.0000190
Greenland Czech Republic 0.0000190
Poland Ireland 0.0000185
Canada Iran 0.0000182
Brazil Spain 0.0000181
Ghana Germany 0.0000181
Ghana Peru 0.0000181
Czech Republic United Kingdom 0.0000174
Latvia United Kingdom 0.0000174
Antarctica Netherlands 0.0000172
Italy New Zealand 0.0000171
Latvia Germany 0.0000171
South Africa France 0.0000171
Taiwan Sweden 0.0000170
Cyprus Germany 0.0000161
Portugal Canada 0.0000161
Mexico Russia 0.0000159
Uzbekistan Poland 0.0000156
Uzbekistan Norway 0.0000153
Uzbekistan New Zealand 0.0000153
Uzbekistan New Caledonia 0.0000153
Uzbekistan Uruguay 0.0000153
Antarctica Belgium 0.0000152
Uzbekistan Panama 0.0000152
South Africa Australia 0.0000152
China Peru 0.0000152
Uzbekistan South Korea 0.0000151
Spain Croatia 0.0000150
Uzbekistan Italy 0.0000150
Uzbekistan Taiwan 0.0000150
Argentina Israel 0.0000150
Portugal Finland 0.0000149
Poland Belgium 0.0000145
Czech Republic New Zealand 0.0000143
New Caledonia United Kingdom 0.0000138
Brazil Belgium 0.0000136
Poland Switzerland 0.0000131
Egypt United Kingdom 0.0000131
Portugal Iran 0.0000130
Switzerland Curaçao 0.0000130
Japan Curaçao 0.0000130
Netherlands Finland 0.0000126
United Kingdom South Africa 0.0000123
South Africa Romania 0.0000123
Bangladesh Germany 0.0000120
South Africa Mauritius 0.0000117
India Poland 0.0000117
Germany Finland 0.0000117
India Austria 0.0000116
Peru Unknown 0.0000116
Switzerland Finland 0.0000116
Peru Israel 0.0000114
United States Argentina 0.0000112
Belgium Austria 0.0000111
Greece Canada 0.0000111
Israel Austria 0.0000111
Uzbekistan Ireland 0.0000110
Israel Germany 0.0000110
Greece Switzerland 0.0000109
Cyprus Switzerland 0.0000109
Poland Sweden 0.0000109
Czech Republic Sweden 0.0000108
Costa Rica Germany 0.0000108
India United Kingdom 0.0000107
Israel Poland 0.0000107
Japan Turkey 0.0000105
Australia Curaçao 0.0000102
India France 0.0000102
Brazil Czech Republic 0.0000102
Niue Israel 0.0000101
Ireland France 0.0000101
Ireland Nigeria 0.0000101
Norway Nigeria 0.0000101
South Africa Nigeria 0.0000101
Portugal Sweden 0.0000100
China Argentina 0.0000100
India Spain 0.0000099
Nigeria United States 0.0000097
Luxembourg Netherlands 0.0000097
Austria Austria 0.0000096
Sweden Czech Republic 0.0000095
China China 0.0000095
Ireland Spain 0.0000094
Chile Sweden 0.0000092
India Germany 0.0000091
Spain Lithuania 0.0000090
Egypt Australia 0.0000089
Singapore Norway 0.0000088
Uzbekistan Belgium 0.0000086
Ireland United States 0.0000083
Greece Russia 0.0000082
Israel Netherlands 0.0000082
Austria China 0.0000082
New Zealand Curaçao 0.0000082
Costa Rica Norway 0.0000081
Czech Republic Netherlands 0.0000080
Latvia Netherlands 0.0000080
Spain South Africa 0.0000080
France Croatia 0.0000080
Brazil Russia 0.0000079
New Zealand Russia 0.0000078
India Netherlands 0.0000078
Argentina France 0.0000076
Uzbekistan Switzerland 0.0000076
Antarctica Italy 0.0000076
Luxembourg Austria 0.0000074
Denmark Finland 0.0000074
Norway Sweden 0.0000073
Israel Canada 0.0000073
Slovenia Unknown 0.0000073
Chile France 0.0000072
Austria Brazil 0.0000072
India Canada 0.0000071
Luxembourg China 0.0000070
Portugal Switzerland 0.0000070
Costa Rica Australia 0.0000070
Costa Rica New Zealand 0.0000068
Singapore New Zealand 0.0000068
Canada Croatia 0.0000068
Uruguay Switzerland 0.0000067
Netherlands Greece 0.0000067
Colombia Israel 0.0000066
Luxembourg Canada 0.0000066
Luxembourg Peru 0.0000065
Austria Italy 0.0000065
Taiwan United Kingdom 0.0000065
Brazil Israel 0.0000065
Luxembourg Denmark 0.0000064
China Norway 0.0000064
China Uruguay 0.0000064
China Panama 0.0000064
China South Korea 0.0000064
China Taiwan 0.0000063
Brazil Sweden 0.0000063
Egypt Peru 0.0000063
Czech Republic Austria 0.0000062
Latvia Austria 0.0000062
New Zealand Greece 0.0000062
Luxembourg Brazil 0.0000061
Singapore Denmark 0.0000061
Czech Republic China 0.0000059
Latvia China 0.0000059
South Africa Canada 0.0000058
Austria Norway 0.0000056
Austria New Zealand 0.0000056
Austria Uruguay 0.0000056
Austria Panama 0.0000056
Austria South Korea 0.0000056
Austria Taiwan 0.0000055
Czech Republic Canada 0.0000055
Latvia Canada 0.0000055
Czech Republic Peru 0.0000054
Latvia Peru 0.0000054
Czech Republic Denmark 0.0000054
Latvia Denmark 0.0000054
Antarctica Brazil 0.0000053
Costa Rica United States 0.0000052
Portugal Japan 0.0000052
Chile Spain 0.0000052
Chile Romania 0.0000052
Czech Republic Brazil 0.0000051
Latvia Brazil 0.0000051
New Zealand Fiji 0.0000050
Uzbekistan Czech Republic 0.0000050
Luxembourg Poland 0.0000049
Chile Mauritius 0.0000049
Israel Norway 0.0000048
Luxembourg Norway 0.0000048
Luxembourg New Zealand 0.0000048
Luxembourg New Caledonia 0.0000048
Luxembourg Uruguay 0.0000048
Spain Israel 0.0000048
Luxembourg Panama 0.0000048
Luxembourg South Korea 0.0000048
Spain Curaçao 0.0000048
Luxembourg Italy 0.0000048
Luxembourg Taiwan 0.0000047
Ukraine Sweden 0.0000047
United Kingdom Saudi Arabia 0.0000047
Ukraine France 0.0000047
Taiwan Greece 0.0000046
Netherlands Turkey 0.0000046
Ukraine Spain 0.0000046
Ukraine Romania 0.0000046
Mexico Israel 0.0000045
Unknown Tunisia 0.0000045
Nigeria Australia 0.0000044
Ukraine Mauritius 0.0000043
Netherlands Japan 0.0000043
United States India 0.0000042
South Africa Denmark 0.0000041
Portugal Greece 0.0000041
Czech Republic Poland 0.0000041
Latvia Poland 0.0000041
Nepal Germany 0.0000040
United States Fiji 0.0000040
Latvia Norway 0.0000040
Latvia New Zealand 0.0000040
Czech Republic New Caledonia 0.0000040
Czech Republic Uruguay 0.0000040
Latvia New Caledonia 0.0000040
Latvia Uruguay 0.0000040
Switzerland Japan 0.0000040
Czech Republic Panama 0.0000040
Latvia Panama 0.0000040
Czech Republic South Korea 0.0000040
Latvia South Korea 0.0000040
India Switzerland 0.0000040
Norway Russia 0.0000040
Latvia Italy 0.0000040
Czech Republic Taiwan 0.0000040
Latvia Taiwan 0.0000040
India Czech Republic 0.0000038
Norway Iran 0.0000037
China South Africa 0.0000037
South Africa Austria 0.0000037
Greece Hungary 0.0000036
Kenya Israel 0.0000036
Kenya France 0.0000036
Denmark Argentina 0.0000036
South Africa China 0.0000035
Israel Czech Republic 0.0000035
Saudi Arabia Germany 0.0000035
Nigeria United Kingdom 0.0000035
Luxembourg Ireland 0.0000035
Saudi Arabia Switzerland 0.0000034
New Zealand Hungary 0.0000034
Nepal Belgium 0.0000034
Malaysia Australia 0.0000034
Malaysia New Zealand 0.0000034
Malaysia Unknown 0.0000034
Nigeria Germany 0.0000034
Egypt Netherlands 0.0000032
Germany Bulgaria 0.0000031
Brazil Colombia 0.0000031
South Africa Brazil 0.0000031
Austria Switzerland 0.0000030
Brazil Portugal 0.0000030
Bulgaria Israel 0.0000029
Switzerland South Africa 0.0000029
Canada Finland 0.0000029
Czech Republic Ireland 0.0000029
Brazil Chile 0.0000029
British Virgin Islands Unknown 0.0000029
Argentina Chile 0.0000029
Colombia Chile 0.0000029
Kenya Chile 0.0000029
Spain Chile 0.0000029
Latvia Ireland 0.0000029
Iceland United States 0.0000029
Portugal Denmark 0.0000028
Italy Canada 0.0000028
Belgium Belgium 0.0000027
Israel Belgium 0.0000027
United Kingdom Curaçao 0.0000027
Luxembourg Belgium 0.0000027
Canada Argentina 0.0000027
Poland Bulgaria 0.0000027
Israel Peru 0.0000026
Germany Japan 0.0000026
Uganda Sweden 0.0000025
South Africa Poland 0.0000025
Egypt Denmark 0.0000025
Egypt Austria 0.0000025
British Virgin Islands United Kingdom 0.0000024
South Africa Norway 0.0000024
Luxembourg Switzerland 0.0000024
South Africa New Zealand 0.0000024
South Africa New Caledonia 0.0000024
South Africa Uruguay 0.0000024
South Africa Panama 0.0000024
South Africa South Korea 0.0000024
South Africa Taiwan 0.0000024
Egypt China 0.0000023
Nepal Peru 0.0000023
United Kingdom Bulgaria 0.0000023
Saudi Arabia United States 0.0000023
Costa Rica Belgium 0.0000023
Canada Russia 0.0000023
Czech Republic Belgium 0.0000023
Latvia Belgium 0.0000023
South Africa Iran 0.0000023
Unknown Slovakia 0.0000023
Bulgaria Colombia 0.0000022
Ireland Australia 0.0000022
Egypt Canada 0.0000022
Nepal Netherlands 0.0000021
Austria Czech Republic 0.0000021
Kenya South Africa 0.0000021
China Vietnam 0.0000021
Mexico Sweden 0.0000021
Egypt Brazil 0.0000020
Czech Republic Switzerland 0.0000020
Latvia Switzerland 0.0000020
India Brazil 0.0000020
Argentina Sweden 0.0000020
Kenya Sweden 0.0000019
South Africa Sweden 0.0000018
Slovenia Denmark 0.0000017
Indonesia Germany 0.0000017
Ireland United Kingdom 0.0000017
South Africa Ireland 0.0000017
Indonesia Switzerland 0.0000017
France Finland 0.0000017
Denmark Romania 0.0000017
Nepal Romania 0.0000017
Nepal Spain 0.0000017
Ireland Germany 0.0000017
Egypt Poland 0.0000016
China Mauritius 0.0000016
Denmark Mauritius 0.0000016
Nepal Mauritius 0.0000016
Nigeria Netherlands 0.0000016
Egypt Norway 0.0000016
Egypt New Zealand 0.0000016
Egypt New Caledonia 0.0000016
Egypt Uruguay 0.0000016
Egypt Panama 0.0000016
Egypt South Korea 0.0000016
Egypt Italy 0.0000016
Egypt Taiwan 0.0000016
Luxembourg Czech Republic 0.0000016
Costa Rica Peru 0.0000016
Hungary Germany 0.0000015
Portugal China 0.0000015
Greenland Sweden 0.0000015
Bangladesh Unknown 0.0000015
Costa Rica Netherlands 0.0000014
Brazil South Africa 0.0000014
Portugal Singapore 0.0000014
Nigeria Austria 0.0000014
Canada Singapore 0.0000013
South Korea South Africa 0.0000013
Latvia Czech Republic 0.0000013
Denmark India 0.0000013
Uganda France 0.0000013
Finland Colombia 0.0000013
Colombia Croatia 0.0000013
Saudi Arabia Unknown 0.0000012
South Africa Switzerland 0.0000012
Denmark Bulgaria 0.0000012
Israel China 0.0000012
Nigeria China 0.0000012
Netherlands Singapore 0.0000012
Egypt Ireland 0.0000012
Taiwan Taiwan 0.0000012
Indonesia United States 0.0000011
Israel Spain 0.0000011
Belgium Romania 0.0000011
Brazil Romania 0.0000011
Costa Rica Romania 0.0000011
Costa Rica Spain 0.0000011
Israel Romania 0.0000011
Nigeria Canada 0.0000011
France Turkey 0.0000011
New Caledonia Sweden 0.0000011
Nigeria Peru 0.0000011
Belgium Mauritius 0.0000011
Brazil Mauritius 0.0000011
Costa Rica Mauritius 0.0000011
Israel Mauritius 0.0000011
Switzerland Singapore 0.0000011
Taiwan Poland 0.0000011
Nigeria Denmark 0.0000011
Mexico France 0.0000011
Israel Brazil 0.0000010
Nigeria Brazil 0.0000010
Niue United Kingdom 0.0000010
Switzerland Croatia 0.0000010
Canada Hungary 0.0000010
Finland Portugal 0.0000010
Nepal United Kingdom 0.0000010
Nepal France 0.0000010
Chile Iran 0.0000009
Ireland Denmark 0.0000009
Chile Colombia 0.0000009
Canada Bulgaria 0.0000009
Egypt Belgium 0.0000009
Slovenia Poland 0.0000009
Slovenia Austria 0.0000009
Slovenia United States 0.0000009
Nigeria South Africa 0.0000009
Japan Argentina 0.0000008
Ukraine Iran 0.0000008
Nigeria Poland 0.0000008
Ireland Netherlands 0.0000008
Nigeria Norway 0.0000008
Egypt Switzerland 0.0000008
Israel New Zealand 0.0000008
Nigeria New Zealand 0.0000008
Belgium Uruguay 0.0000008
Israel New Caledonia 0.0000008
Israel Uruguay 0.0000008
Nigeria New Caledonia 0.0000008
Nigeria Uruguay 0.0000008
Belgium Panama 0.0000008
Israel Panama 0.0000008
Nigeria Panama 0.0000008
Israel South Korea 0.0000008
Nigeria South Korea 0.0000008
Israel Italy 0.0000008
Israel Taiwan 0.0000008
Nigeria Taiwan 0.0000008
South Africa Czech Republic 0.0000008
Slovenia France 0.0000008
Germany Croatia 0.0000008
Greenland France 0.0000008
Japan Vietnam 0.0000007
China Finland 0.0000007
Germany Singapore 0.0000007
Australia Colombia 0.0000007
Austria Curaçao 0.0000007
New Zealand Colombia 0.0000007
Costa Rica United Kingdom 0.0000007
Costa Rica France 0.0000006
Taiwan Italy 0.0000006
Indonesia Unknown 0.0000006
Ireland Austria 0.0000006
Spain Bulgaria 0.0000006
Japan Finland 0.0000006
United States Luxembourg 0.0000006
Ireland China 0.0000006
Israel Ireland 0.0000006
Nigeria Ireland 0.0000006
Ireland Canada 0.0000005
Slovenia Canada 0.0000005
Ireland Peru 0.0000005
New Zealand South Africa 0.0000005
Egypt Czech Republic 0.0000005
Ireland Brazil 0.0000005
Norway Colombia 0.0000005
Slovenia Netherlands 0.0000005
Slovenia United Kingdom 0.0000005
Colombia Colombia 0.0000005
Denmark Russia 0.0000005
Spain Russia 0.0000005
Nigeria Belgium 0.0000005
Unknown Sri Lanka 0.0000005
Ukraine Colombia 0.0000004
Ireland Poland 0.0000004
Ireland Norway 0.0000004
Israel Switzerland 0.0000004
Nigeria Switzerland 0.0000004
Ireland New Zealand 0.0000004
Ireland New Caledonia 0.0000004
Ireland Uruguay 0.0000004
Ireland Panama 0.0000004
Ireland South Korea 0.0000004
Ireland Taiwan 0.0000004
Uzbekistan Sweden 0.0000004
Netherlands Argentina 0.0000004
Nepal Canada 0.0000004
Antarctica Israel 0.0000003
China Bulgaria 0.0000003
Mexico South Africa 0.0000003
Spain India 0.0000003
Germany India 0.0000003
China Iran 0.0000003
Denmark Iran 0.0000003
Nepal Iran 0.0000003
United Arab Emirates Italy 0.0000003
United Kingdom Mexico 0.0000003
Slovenia Czech Republic 0.0000003
Japan Bulgaria 0.0000003
Kosovo Ireland 0.0000003
Ireland Ireland 0.0000003
Portugal Poland 0.0000003
Uganda Colombia 0.0000003
Italy Ireland 0.0000003
Nigeria Czech Republic 0.0000003
Bulgaria Turkey 0.0000003
Japan India 0.0000003
Nepal Australia 0.0000003
Switzerland Bulgaria 0.0000003
China Japan 0.0000003
Antarctica Croatia 0.0000003
United Kingdom Croatia 0.0000003
Nepal Sweden 0.0000002
Poland Spain 0.0000002
Denmark Japan 0.0000002
Denmark Portugal 0.0000002
Costa Rica Canada 0.0000002
Ireland Belgium 0.0000002
Mexico Colombia 0.0000002
New Zealand Bulgaria 0.0000002
Peru United States 0.0000002
Argentina Colombia 0.0000002
Belgium Iran 0.0000002
Brazil Iran 0.0000002
Costa Rica Iran 0.0000002
Israel Iran 0.0000002
Kenya Colombia 0.0000002
Unknown Mexico 0.0000002
Denmark Hungary 0.0000002
Spain Hungary 0.0000002
Ireland Switzerland 0.0000002
Australia India 0.0000002
Uzbekistan France 0.0000002
India Sweden 0.0000002
Canada Japan 0.0000002
Canada Portugal 0.0000002
Brazil Saudi Arabia 0.0000002
Belgium South Africa 0.0000002
France South Africa 0.0000002
Japan South Africa 0.0000002
Italy Taiwan 0.0000002
South Korea Austria 0.0000002
Switzerland Colombia 0.0000002
Costa Rica Sweden 0.0000002
Greenland Colombia 0.0000002
Unknown Latvia 0.0000002
Japan Japan 0.0000001
Netherlands Bulgaria 0.0000001
Iceland Netherlands 0.0000001
Finland Bulgaria 0.0000001
Spain Colombia 0.0000001
Ireland Czech Republic 0.0000001
Sweden Bulgaria 0.0000001
United Kingdom Turkey 0.0000001
Luxembourg Sweden 0.0000001
Belgium Bulgaria 0.0000001
United States Tunisia 0.0000001
Portugal Italy 0.0000001
Chile Turkey 0.0000001
United States Mexico 0.0000001
Switzerland Mexico 0.0000001
Latvia Sweden 0.0000001
Germany Turkey 0.0000001
Taiwan Bulgaria 0.0000001
Portugal Hungary 0.0000001
Australia Turkey 0.0000001
Sweden Colombia 0.0000001
New Zealand Turkey 0.0000001
Russia United States 0.0000001
Slovenia Germany 0.0000001
Poland Colombia 0.0000001
Nepal Denmark 0.0000001
China Singapore 0.0000001
Netherlands Hungary 0.0000001
United Kingdom Finland 0.0000001
Bulgaria India 0.0000001
Switzerland Hungary 0.0000001
Luxembourg France 0.0000001
United States Slovakia 0.0000001
Japan Portugal 0.0000001
Norway Turkey 0.0000001
Colombia Turkey 0.0000001
Australia Vietnam 0.0000001
Finland Vietnam 0.0000001
United Kingdom Vietnam 0.0000001
Latvia France 0.0000001
Portugal Luxembourg 0.0000001
Ukraine Turkey 0.0000001
Portugal India 0.0000000
Costa Rica Denmark 0.0000000
South Korea United States 0.0000000
Netherlands Luxembourg 0.0000000
Germany Hungary 0.0000000
Uzbekistan Colombia 0.0000000
Hungary France 0.0000000
Egypt Sweden 0.0000000
Netherlands India 0.0000000
Colombia India 0.0000000
Switzerland Luxembourg 0.0000000
Switzerland India 0.0000000
China Israel 0.0000000
Hungary Italy 0.0000000
Italy Austria 0.0000000
Hungary United Kingdom 0.0000000
Finland Turkey 0.0000000
Uganda Turkey 0.0000000
Germany Luxembourg 0.0000000
Mexico Turkey 0.0000000
Argentina Turkey 0.0000000
Netherlands Portugal 0.0000000
Kenya Turkey 0.0000000
Japan Singapore 0.0000000
Australia Finland 0.0000000
Bulgaria Argentina 0.0000000
Egypt France 0.0000000
Nigeria Sweden 0.0000000
Switzerland Turkey 0.0000000
Greenland Turkey 0.0000000
Greece United Kingdom 0.0000000
Iceland United Kingdom 0.0000000
Singapore United Kingdom 0.0000000
Spain Turkey 0.0000000
Luxembourg Colombia 0.0000000
Brazil Mexico 0.0000000
United States Sri Lanka 0.0000000
Czech Republic Colombia 0.0000000
Latvia Colombia 0.0000000
Nigeria France 0.0000000
Ireland Sweden 0.0000000
Portugal Tunisia 0.0000000
Sweden Turkey 0.0000000
Chile Argentina 0.0000000
Bulgaria Finland 0.0000000
Poland Turkey 0.0000000
Netherlands Tunisia 0.0000000
Switzerland Tunisia 0.0000000
Germany Argentina 0.0000000
France India 0.0000000
United Kingdom Japan 0.0000000
Nepal Poland 0.0000000
Australia Japan 0.0000000
South Africa Colombia 0.0000000
Australia Argentina 0.0000000
Finland Japan 0.0000000
Brazil Turkey 0.0000000
Austria Finland 0.0000000
New Zealand Argentina 0.0000000
Hungary Switzerland 0.0000000
France Japan 0.0000000
France Portugal 0.0000000
Hong Kong United Kingdom 0.0000000
Canada India 0.0000000
Germany Tunisia 0.0000000
Portugal Czech Republic 0.0000000
Portugal Slovakia 0.0000000
Uzbekistan Turkey 0.0000000
United Arab Emirates United Kingdom 0.0000000
Costa Rica Poland 0.0000000
Norway Argentina 0.0000000
Egypt Colombia 0.0000000
Colombia Argentina 0.0000000
Peru Netherlands 0.0000000
Netherlands Slovakia 0.0000000
China Hungary 0.0000000
Ukraine Argentina 0.0000000
United States Latvia 0.0000000
Switzerland Slovakia 0.0000000
Chile Finland 0.0000000
Hong Kong Canada 0.0000000
India Bulgaria 0.0000000
Greece Austria 0.0000000
Chile Bulgaria 0.0000000
Nepal Italy 0.0000000
China India 0.0000000
Finland Argentina 0.0000000
China Luxembourg 0.0000000
Uganda Argentina 0.0000000
Germany Slovakia 0.0000000
New Zealand Finland 0.0000000
Hungary Poland 0.0000000
United Arab Emirates Netherlands 0.0000000
Belgium Colombia 0.0000000
Israel Colombia 0.0000000
Nigeria Colombia 0.0000000
Mexico Argentina 0.0000000
China Turkey 0.0000000
Argentina Argentina 0.0000000
Kenya Argentina 0.0000000
Austria Turkey 0.0000000
Costa Rica Italy 0.0000000
Norway Finland 0.0000000
Israel Bulgaria 0.0000000
Colombia Finland 0.0000000
Norway Bulgaria 0.0000000
Ukraine Finland 0.0000000
Switzerland Argentina 0.0000000
Luxembourg Turkey 0.0000000
Niue United States 0.0000000
Greenland Argentina 0.0000000
Argentina Bulgaria 0.0000000
Japan Hungary 0.0000000
Bulgaria Japan 0.0000000
Bulgaria Portugal 0.0000000
Ukraine Bulgaria 0.0000000
Spain Argentina 0.0000000
Czech Republic Turkey 0.0000000
Latvia Turkey 0.0000000
Ireland Colombia 0.0000000
British Virgin Islands Netherlands 0.0000000
Uganda Finland 0.0000000
United Kingdom India 0.0000000
Portugal Sri Lanka 0.0000000
Antarctica India 0.0000000
Hong Kong Austria 0.0000000
Japan Luxembourg 0.0000000
Uganda Bulgaria 0.0000000
Mexico Finland 0.0000000
Netherlands Sri Lanka 0.0000000
Argentina Finland 0.0000000
Costa Rica Finland 0.0000000
Kenya Finland 0.0000000
South Africa Turkey 0.0000000
Switzerland Sri Lanka 0.0000000
Sweden Argentina 0.0000000
Mexico Bulgaria 0.0000000
Uganda Spain 0.0000000
United Kingdom Portugal 0.0000000
Poland Argentina 0.0000000
Kenya Bulgaria 0.0000000
Slovenia Sweden 0.0000000
Argentina Spain 0.0000000
Greenland Finland 0.0000000
Chile Japan 0.0000000
Chile Lithuania 0.0000000
Chile Portugal 0.0000000
Mexico Spain 0.0000000
Spain Finland 0.0000000
Kenya Spain 0.0000000
Brazil Argentina 0.0000000
Egypt Turkey 0.0000000
Germany Sri Lanka 0.0000000
Germany Portugal 0.0000000
Greenland Bulgaria 0.0000000
China Tunisia 0.0000000
Australia Lithuania 0.0000000
Australia Portugal 0.0000000
New Zealand Japan 0.0000000
New Zealand Portugal 0.0000000
Portugal Russia 0.0000000
Israel Finland 0.0000000
Greenland Spain 0.0000000
Uzbekistan Argentina 0.0000000
Netherlands Russia 0.0000000
Portugal Latvia 0.0000000
Switzerland Russia 0.0000000
Norway Japan 0.0000000
Norway Lithuania 0.0000000
Norway Portugal 0.0000000
Sweden Finland 0.0000000
Colombia Japan 0.0000000
Colombia Portugal 0.0000000
Netherlands Latvia 0.0000000
Ukraine Japan 0.0000000
Ukraine Lithuania 0.0000000
Ukraine Portugal 0.0000000
Hungary Netherlands 0.0000000
Poland Finland 0.0000000
Switzerland Latvia 0.0000000
Belgium Turkey 0.0000000
Israel Turkey 0.0000000
Nigeria Turkey 0.0000000
China Slovakia 0.0000000
Portugal Ireland 0.0000000
Brazil Finland 0.0000000
Austria Bulgaria 0.0000000
Germany Latvia 0.0000000
Brazil Bulgaria 0.0000000
Finland Lithuania 0.0000000
Japan Tunisia 0.0000000
Uganda Japan 0.0000000
Uganda Lithuania 0.0000000
Uganda Portugal 0.0000000
Slovenia Bulgaria 0.0000000
Uzbekistan Finland 0.0000000
Mexico Japan 0.0000000
Mexico Lithuania 0.0000000
Mexico Portugal 0.0000000
Argentina Japan 0.0000000
Argentina Lithuania 0.0000000
Argentina Portugal 0.0000000
Kenya Japan 0.0000000
Kenya Lithuania 0.0000000
Kenya Portugal 0.0000000
Uzbekistan Bulgaria 0.0000000
Ireland Turkey 0.0000000
Luxembourg Argentina 0.0000000
Switzerland Lithuania 0.0000000
Switzerland Portugal 0.0000000
Uzbekistan Spain 0.0000000
Czech Republic Argentina 0.0000000
Latvia Argentina 0.0000000
Australia Hungary 0.0000000
Finland Hungary 0.0000000
United Kingdom Hungary 0.0000000
Greenland Japan 0.0000000
Greenland Lithuania 0.0000000
Greenland Portugal 0.0000000
Spain Japan 0.0000000
Spain Portugal 0.0000000
Japan Slovakia 0.0000000
Portugal Mexico 0.0000000
Netherlands Mexico 0.0000000
Australia Luxembourg 0.0000000
Finland Luxembourg 0.0000000
United Kingdom Luxembourg 0.0000000
South Africa Argentina 0.0000000
Finland India 0.0000000
Hong Kong Switzerland 0.0000000
Luxembourg Finland 0.0000000
Sweden Japan 0.0000000
Sweden Lithuania 0.0000000
Sweden Portugal 0.0000000
China Sri Lanka 0.0000000
Poland Japan 0.0000000
Poland Lithuania 0.0000000
Poland Portugal 0.0000000
Luxembourg Bulgaria 0.0000000
Czech Republic Finland 0.0000000
Egypt Argentina 0.0000000
Latvia Finland 0.0000000
Brazil Japan 0.0000000
Brazil Lithuania 0.0000000
Czech Republic Bulgaria 0.0000000
Latvia Bulgaria 0.0000000
Luxembourg Spain 0.0000000
Czech Republic Spain 0.0000000
Latvia Spain 0.0000000
Uzbekistan Japan 0.0000000
Uzbekistan Lithuania 0.0000000
Uzbekistan Portugal 0.0000000
Portugal Uruguay 0.0000000
South Africa Finland 0.0000000
China Russia 0.0000000
South Africa Bulgaria 0.0000000
Belgium Argentina 0.0000000
Israel Argentina 0.0000000
Nigeria Argentina 0.0000000
Japan Sri Lanka 0.0000000
China Latvia 0.0000000
Egypt Finland 0.0000000
Slovenia Spain 0.0000000
Egypt Bulgaria 0.0000000
Australia Tunisia 0.0000000
Finland Tunisia 0.0000000
United Kingdom Tunisia 0.0000000
China Portugal 0.0000000
Egypt Spain 0.0000000
Ireland Argentina 0.0000000
Austria Japan 0.0000000
Austria Lithuania 0.0000000
Austria Portugal 0.0000000
Luxembourg Japan 0.0000000
Luxembourg Lithuania 0.0000000
Luxembourg Portugal 0.0000000
Belgium Finland 0.0000000
Nigeria Finland 0.0000000
Japan Russia 0.0000000
Czech Republic Japan 0.0000000
Czech Republic Lithuania 0.0000000
Czech Republic Portugal 0.0000000
Latvia Japan 0.0000000
Latvia Lithuania 0.0000000
Latvia Portugal 0.0000000
Nigeria Bulgaria 0.0000000
Australia Slovakia 0.0000000
Finland Slovakia 0.0000000
United Kingdom Slovakia 0.0000000
Japan Latvia 0.0000000
Nigeria Spain 0.0000000
South Africa Japan 0.0000000
South Africa Lithuania 0.0000000
South Africa Portugal 0.0000000
Ireland Finland 0.0000000
China Mexico 0.0000000
Ireland Bulgaria 0.0000000
Egypt Japan 0.0000000
Egypt Lithuania 0.0000000
Egypt Portugal 0.0000000
Belgium Japan 0.0000000
Belgium Lithuania 0.0000000
Belgium Portugal 0.0000000
Israel Japan 0.0000000
Israel Lithuania 0.0000000
Israel Portugal 0.0000000
Nigeria Japan 0.0000000
Nigeria Lithuania 0.0000000
Nigeria Portugal 0.0000000
Japan Mexico 0.0000000
Australia Sri Lanka 0.0000000
Finland Sri Lanka 0.0000000
United Kingdom Sri Lanka 0.0000000
Ireland Japan 0.0000000
Ireland Lithuania 0.0000000
Ireland Portugal 0.0000000
Finland Russia 0.0000000
Australia Latvia 0.0000000
Finland Latvia 0.0000000
United Kingdom Latvia 0.0000000
Australia Mexico 0.0000000
Finland Mexico 0.0000000
France Nigeria 0.0000000
Code
dependency_summary_noUnknown <- dependency_summary %>%
                                    filter(Cited_Country != "Unknown" & Citing_Country != "Unknown")%>%
                                    arrange(desc(Total_Dependency_Fraction))

dependency_summary <- dependency_summary %>%
                                    arrange(desc(Total_Dependency_Fraction))

3.2.2 Sectors

Code
### select dependency information for slugs and packages
cran_github_rdi <- cran_github %>%
                      select(Package, slug, Depends)

### rename columns
colnames(cran_github_rdi) <- c("Citing_Package", "slug", "Dependencies")


### Package citation column will be the unlisted dependencies column
cran_github_rdi$Package_Citation <- cran_github_rdi$Dependencies


### join commits information for the citing packages
cran_github_RDI <- cran_github_rdi %>%
                      inner_join(user_commits_total, by = "slug")%>%
                        select(Citing_Package, slug, Dependencies, login,
                              sector, total_additions, total_code_for_slug,
                              contribution_fraction_loc, Package_Citation) %>%
                       # Remove rows with NA in Depends
                        filter(!is.na(Package_Citation))

### rename columns on the basis of the citing package
colnames(cran_github_RDI) <- c("Citing_Package", "Citing_Slug", "Dependencies", "Citing_Login",  "Citing_Sector",
                                "Citing_Additions", "Citing_Total_Slug_Additions", "Citing_Package_Fraction" , "Package_Citation")


### unlist the dependencies for joining
cran_github_RDI_network <-  cran_github_RDI %>%
  separate_rows(Package_Citation, sep = ",\\s*") %>%
  filter(Package_Citation != "")


#### prepare commits information for cited packages
user_commits_rdi <- user_commits_total %>%
  mutate(Package_Citation = str_split(slug, "/", simplify = TRUE)[, 2])%>%
  select(login, sector, total_additions, total_code_for_slug, contribution_fraction_loc, Package_Citation)
  
  colnames(user_commits_rdi) <- c( "Cited_Login", "Cited_Sector", 
                                   "Cited_Additions", "Cited_Total_Slug_Additions", "Cited_Package_Fraction", "Package_Citation")
  
  ### join cited package commit information to citing package dataframe
  cran_github_rdi_full <- cran_github_RDI_network %>%
                                        inner_join(user_commits_rdi, by = "Package_Citation")

  ### create dependency_fraction = citing package fraction multiplied by cited package fraction 
  cran_github_rdi_grouped <- cran_github_rdi_full %>%
  mutate(Dependency_Fraction = Citing_Package_Fraction * Cited_Package_Fraction)
Code
# Group by Cited Country and Citing Country, and sum Dependency_Fraction

### the number of citations made from one country to another is simply the sum of the fractioned scores associated with each pair, with the sum across all possible pairs adding up to the total number of citations made at the world level.

dependency_summary <- cran_github_rdi_grouped %>%
  group_by(Cited_Sector, Citing_Sector) %>%
  summarize(Total_Dependency_Fraction = sum(Dependency_Fraction, na.rm = TRUE))

sum(dependency_summary$Total_Dependency_Fraction)
[1] 589
Code
# Group by Cited Country and sum Total_Dependency_Fraction - total number of citations attributed to each country
citations_by_sector <- dependency_summary %>%
  group_by(Cited_Sector) %>%
  summarize(Fraction_of_Citations = round(sum(Total_Dependency_Fraction, na.rm = TRUE), 4))


sum(citations_by_sector$Fraction_of_Citations)
[1] 589
Code
citations_by_sector$Denominator_RDI <- round(citations_by_sector$Fraction_of_Citations / sum(citations_by_sector$Fraction_of_Citations),4)

# Group by citing country and sum Total_Dependency_Fraction - total number of citations made by each country
citings_by_sector <- dependency_summary %>%
  group_by(Citing_Sector) %>%
  summarize(Fraction_of_Citings = round(sum(Total_Dependency_Fraction, na.rm = TRUE), 4))


sum(citings_by_sector$Fraction_of_Citings)
[1] 588.9999
Code
# join citings by country with dependency_summary

citings_dependency_summary <- citings_by_sector %>%
                                full_join(dependency_summary, by = "Citing_Sector")

citings_dependency_summary$Numerator_RDI <- round(citings_dependency_summary$Total_Dependency_Fraction / citings_dependency_summary$Fraction_of_Citings,4)

## join denominator_RDI

citations_citings_dependency_summary <- citations_by_sector %>%
                                full_join(citings_dependency_summary, by = "Cited_Sector") %>%
                                select(Citing_Sector, Cited_Sector, Numerator_RDI, Denominator_RDI)

citations_citings_dependency_summary$RDI <- round(citations_citings_dependency_summary$Numerator_RDI / citations_citings_dependency_summary$Denominator_RDI,4)

3.2.3 Double-sided weight graph

Code
# Calculate the total of Fraction_of_Citations, including "Unknown"
total_citations_incl_unknown <- sum(citations_by_country$Fraction_of_Citations)

# Create and round the percentage column to the nearest hundredth, including "Unknown" in the percentage calculation
citations_by_country$Percentage_of_Citations <- round(
  (citations_by_country$Fraction_of_Citations / total_citations_incl_unknown) * 100, 2
)

# Arrange by descending order of the new percentage column
 citations_by_country %>%
  arrange(desc(Percentage_of_Citations))
# A tibble: 70 × 4
   Cited_Country Fraction_of_Citations Denominator_RDI Percentage_of_Citations
   <chr>                         <dbl>           <dbl>                   <dbl>
 1 Unknown                      200.            0.340                    34.0 
 2 United States                172.            0.292                    29.2 
 3 Germany                       41.3           0.0702                    7.02
 4 France                        30.6           0.0519                    5.19
 5 Denmark                       23.1           0.0393                    3.93
 6 Canada                        18.6           0.0316                    3.16
 7 Norway                        14.9           0.0252                    2.52
 8 Netherlands                   12.4           0.0211                    2.11
 9 Bulgaria                      11.2           0.019                     1.9 
10 Australia                      9.99          0.017                     1.7 
# ℹ 60 more rows
Code
# Calculate the total of Fraction_of_Citations, excluding "Unknown"
total_citations <- sum(citations_by_country$Fraction_of_Citations[citations_by_country$Cited_Country != "Unknown"])

# Create and round the percentage column to the nearest hundredth, excluding "Unknown" in the percentage calculation
citations_by_country$Percentage_of_Citations <- ifelse(
  citations_by_country$Cited_Country == "Unknown", 
  NA, 
  round((citations_by_country$Fraction_of_Citations / total_citations) * 100, 2)
)



citations_by_country %>%
  arrange(desc(Percentage_of_Citations))
# A tibble: 70 × 4
   Cited_Country  Fraction_of_Citations Denominator_RDI Percentage_of_Citations
   <chr>                          <dbl>           <dbl>                   <dbl>
 1 United States                 172.            0.292                    44.2 
 2 Germany                        41.3           0.0702                   10.6 
 3 France                         30.6           0.0519                    7.87
 4 Denmark                        23.1           0.0393                    5.95
 5 Canada                         18.6           0.0316                    4.79
 6 Norway                         14.9           0.0252                    3.83
 7 Netherlands                    12.4           0.0211                    3.19
 8 Bulgaria                       11.2           0.019                     2.88
 9 Australia                       9.99          0.017                     2.57
10 United Kingdom                  9.65          0.0164                    2.48
# ℹ 60 more rows

The following graph shows a countries’ lines of code credit compared to the percentage of citations they have from other countries (or reverse dependencies in package labguage)

Code
data <- data.frame(
  Country = c("United States", "Germany", "United Kingdom", "France", "Canada", 
              "Australia", "Netherlands", "Switzerland", "Spain", "China",
              "United States", "Germany", "United Kingdom", "France", "Canada", 
              "Australia", "Netherlands", "Switzerland", "Spain", "China"),
  Measure = c("Package %", "Package %", "Package %", "Package %", "Package %", 
              "Package %", "Package %", "Package %", "Package %", "Package %",
              "Reverse Dependency %", "Reverse Dependency %", "Reverse Dependency %", "Reverse Dependency %", "Reverse Dependency %", 
              "Reverse Dependency %", "Reverse Dependency %", "Reverse Dependency %", "Reverse Dependency %", "Reverse Dependency %"),
  Value = c( -30.9, -10.6, -7.6, -5.9, -5.3,  -4.7, -3.6, -2.7, -2.6, -2.2, 
             44.2, 10.6, 2.48, 7.9, 4.8, 2.6, 3.2, 1, 1.4, .3) # Negative for Dependency %, positive for Code%
)

# Filter data to include only Code % values for ordering
code_values <- data %>% 
  filter(Measure == "Package %") %>% 
  arrange(desc(Value))

# Reorder Country factor based on Code % values
data$Country <- factor(data$Country, levels = code_values$Country)

# Create the plot with value labels and ordered countries
# Create the plot with value labels and ordered countries with increased text size
plot <- ggplot(data, aes(x = Country, y = Value, fill = Measure)) +
  geom_bar(stat = "identity", position = "identity") +
  coord_flip() +
  scale_y_continuous(labels = abs, breaks = seq(-50, 50, by = 10), limits = c(-50, 55)) +
  labs(y = "Percentage", x = "", title = "R") +
  theme_minimal() +
  scale_fill_manual(values = c("Package %" = "darkblue", "Reverse Dependency %" = "lightblue")) + # Add your own colors
  theme(
    text = element_text(size = 14), # Changes global text size
    axis.title = element_text(size = 16), # Changes axis title text size
    axis.text = element_text(size = 12), # Changes axis text size
    plot.title = element_text(size = 12, face = "bold", hjust = .5) # Changes plot title text size and makes it bold
  )+
  geom_text(data = subset(data, Value > 0), aes(label = sprintf("%0.1f%%", Value)), 
            position = position_nudge(y = 0.5), hjust = 0, size = 3.5) +
  geom_text(data = subset(data, Value < 0), aes(label = sprintf("%0.1f%%", abs(Value))), 
            position = position_nudge(y = -0.5), hjust = 1, size = 3.5)+
  theme(legend.position = "bottom")

# Display the plot
print(plot)

3.3 Measuring Impact: OSS Developers and Projects

  • Who are the key players (projects, developers, institutions, sectors, and countries) on the networks and how has this changed over time?

  • How do the positions of OSS actors impact OSS contributions?

3.3.1 Distributions of impact measures by top actors

3.3.1.1 What is the distribution of impact measures among sectors?

We take a look at the distributions of some of the impact measures by the sectors to see if certain sectors have packages of more impact.

3.3.1.1.1 All-Time Downloads

It looks like the business sector has packages with the highest all-time downloads on average. This is looking at the log of the downloads for visual purposes.

Code
## Show distribution of downloads by Institution
 ggplot(cran_repos, aes(x = Sector, y = log(Downloads_All_Time), fill = Sector))+
  geom_boxplot()+
  ggtitle("All-Time Downloads Distribution by Sector (GitHub R Packages)")+
  ylab("Log of All-Time Downloads")+
   theme_gdocs()+
    theme(plot.title = element_text(size = 13))+
   coord_flip()+
  xlab("")+
   scale_fill_westat(option = "BLUES", drop = FALSE)

3.3.1.1.2 Normalized Downloads

The same is true of normalized downloads as well… These are probably packages from Rstudio.

Code
## Show distribution of downloads by Institution
 ggplot(cran_repos, aes(x = Sector, y = log(Downloads_Normalized), fill = Sector))+
  geom_boxplot()+
  ggtitle("Normalized Downloads Distribution by Sector (GitHub R Packages)")+
  ylab("Log of Normalized Downloads")+
   theme_gdocs()+
    theme(plot.title = element_text(size = 13))+
   coord_flip()+
   scale_fill_westat(option = "BLUES", drop = FALSE)+
   xlab("")+
   labs(caption = "*Data points represent individual packages")

3.3.1.1.3 Reverse Dependencies

For reverse dependencies, most sectors are at zero on average aside from government and business, which are about 1 on average. Again, this looks at the log of reverse dependencies, so this really means about 10 reverse dependencies on average. There are a lot of observations for Unknown sector that are at the higher end.

Code
## Show distribution of downloads by Institution
 ggplot(cran_repos, aes(x = Sector, y = log(Reverse_Depends_Count), fill = Sector))+
  geom_boxplot()+
  ggtitle("Reverse Dependencies Distribution by Sector (GitHub R Packages)")+
  ylab("Log of Reverse Dependencies")+
   theme_gdocs()+
    theme(plot.title = element_text(size = 13))+
   coord_flip()+
   scale_fill_westat(option = "BLUES", drop = FALSE)+
   xlab("")

3.3.1.1.4 Stars

Business has the highest log of stars on average, followed by government. We know from the EDA that stars and downloads have a moderate correlation so this makes sense.

Code
## Show distribution of downloads by Institution
 ggplot(cran_repos, aes(x = Sector, y = log(stargazer_count), fill = Sector))+
  geom_boxplot()+
  ggtitle("Star Count Distribution by Sector (GitHub R Packages)")+
  ylab("Log of Star Count")+
   theme_gdocs()+
    theme(plot.title = element_text(size = 13))+
   coord_flip()+
   scale_fill_westat(option = "BLUES", drop = FALSE)+
   xlab("")+
   labs(caption = "*Data points represent individual packages")

3.3.1.1.5 Forks

Business has the highest log of forks on average as well, followed by government again. We know from the EDA that stars and forks have a very high correlation so this makes sense too.

Code
## Show distribution of downloads by Institution
 ggplot(cran_repos, aes(x = Sector, y = log(fork_count), fill = Sector))+
  geom_boxplot()+
  ggtitle("Fork Count Distribution by Sector (GitHub R Packages)")+
  ylab("Log of Fork Count")+
   theme_gdocs()+
    theme(plot.title = element_text(size = 13))+
   coord_flip()+
   scale_fill_westat(option = "BLUES", drop = FALSE)+
   xlab("")+
   labs(caption = "*Data points represent individual packages")

3.3.1.2 What is the distribution of impact measures among institutions/organizations?

Of the top institutions/organizations we found on GitHub, we look at the distribution of downloads for these packages.

Code
### Filter for names of the top 5 institutions 
Top_Institution_repos <- cran_repos %>%
                  filter(Institution %in% top10_Institutions_GitHub$Institution)%>%
                  filter(!is.na(Downloads_All_Time))


Top_Institution_repos <- cran_repos %>%
                  filter(Institution %in% top10_Institutions_GitHub$Institution)%>%
                  filter(!is.na(Downloads_All_Time))
3.3.1.2.1 All-Time Downloads

RStudio has the highest log of all-time downloads on average of the top 10 institutions on GitHub.

Code
 ggplot(Top_Institution_repos, aes(x = Institution, y = log(Downloads_All_Time), fill = Sector))+
  geom_boxplot()+
  ggtitle("All-Time Downloads Distribution by Institution - Top 10")+
  ylab("Log of All-Time Downloads")+
   ylim(0,20)+
   theme_gdocs()+
    theme(plot.title = element_text(size = 15))+
   coord_flip()+
   scale_fill_westat(option = "BLUES", drop = FALSE)

3.3.1.2.2 Normalized Downloads

The distributions are very similar for normalized downloads as well.

Code
 ggplot(Top_Institution_repos, aes(x = Institution, y = log(Downloads_Normalized), fill = Sector))+
  geom_boxplot()+
  ggtitle("Normalized Downloads Distribution by Institution - Top 10")+
  ylab("Log of Normalized Downloads")+
   ylim(0,20)+
   theme_gdocs()+
    theme(plot.title = element_text(size = 15))+
   coord_flip()+
   scale_fill_westat(option = "BLUES", drop = FALSE)

3.3.1.2.3 Reverse Dependencies

For reverse dependencies, the log averages are essentially all zero. There are a few that are above this mark, with RStudio having the most observations at the higher end.

Code
 ggplot(Top_Institution_repos, aes(x = Institution, y = log(Reverse_Depends_Count), fill = Sector))+
  geom_boxplot()+
  ggtitle("Reverse Dependencies Distribution by Institution - Top 10")+
  ylab("Log of Reverse Dependencies")+
   ylim(0,20)+
   theme_gdocs()+
    theme(plot.title = element_text(size = 15))+
   coord_flip()+
   scale_fill_westat(option = "BLUES", drop = FALSE)

3.3.1.2.4 Stars

Star count is led by Rstudio as well…UCLA appears to be the next highest on average.

Code
ggplot(Top_Institution_repos, aes(x = Institution, y = log(stargazer_count), fill = Sector))+
  geom_boxplot()+
  ggtitle("Star Count Distribution by Institution - Top 10")+
  ylab("Log of Star Count")+
   ylim(0,20)+
   theme_gdocs()+
    theme(plot.title = element_text(size = 15))+
   coord_flip()+
   scale_fill_westat(option = "BLUES", drop = FALSE)

3.3.1.2.5 Forks

The same that is true of stars is mostly true for forks as well

Code
ggplot(Top_Institution_repos, aes(x = Institution, y = log(fork_count), fill = Sector))+
  geom_boxplot()+
  ggtitle("Fork Count Distribution by Institution - Top 10")+
  ylab("Log of Fork Count")+
   ylim(0,20)+
   theme_gdocs()+
    theme(plot.title = element_text(size = 15))+
   coord_flip()+
   scale_fill_westat(option = "BLUES", drop = FALSE)

The following code is some analysis for a working paper… results are not commented on at the moment, but we want to see if larger teams have more impact on average.

3.4 How does team size relate to impact?

test removing extreme team size outliers, add reverse dependencies, normalized and non-normalized

Literature on team size and citations - do we see same thing?

Code
## Create team size
user_commits_total <- user_commits_total %>%
  group_by(slug) %>%
  mutate(team_size = n()) %>%
  ungroup()

## normalize stars and forks based on year_created 

user_commits_total <- user_commits_total %>%
  mutate(year_created = as.numeric(year_created))

user_commits_total <- user_commits_total %>%
  mutate(
    normalization_factor = ifelse(is.na(year_created), NA, 2023 - year_created + 1),
    stargazer_count_normalized = ifelse(is.na(stargazer_count) | is.na(normalization_factor), stargazer_count, stargazer_count / normalization_factor),
    fork_count_normalized = ifelse(is.na(fork_count) | is.na(normalization_factor), fork_count, fork_count / normalization_factor),
    reverse_dep_normalized = ifelse(is.na(Reverse_Depends_Count) | is.na(normalization_factor), Reverse_Depends_Count, Reverse_Depends_Count / normalization_factor)
  ) 

3.4.1 Bin teamsize

3.4.1.1 With outliers

Code
user_commits_distinct <- user_commits_total %>%
  distinct(slug, .keep_all = TRUE)


quantile(user_commits_distinct$team_size, probs = seq(0, 1, .1))
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
   1    1    1    1    2    2    3    3    5    8  880 
Code
# Define the bins based on the updated percentiles

# Define the bins based on the updated percentiles
bins <- c(1, 2, 3, 5, 8, 880)

# Create labels for the bins
bin_labels <- c("[1]", "[2]", "[3-4]", "[5-7]", "[8-880]")

# Create a new column with binned team sizes and custom labels
user_commits_distinct <- user_commits_distinct %>%
  mutate(team_size_bin = cut(team_size, breaks = bins, labels = bin_labels, include.lowest = TRUE, right = FALSE))

## convert 
table(user_commits_distinct$team_size_bin)

    [1]     [2]   [3-4]   [5-7] [8-880] 
   2441    1804    1588     791     783 
Code
mean(user_commits_distinct$team_size)
[1] 4.619009

3.4.1.2 No outliers (IQR)

Code
# Calculate Q1, Q3, and IQR
Q1 <- quantile(user_commits_distinct$team_size, 0.25)
Q3 <- quantile(user_commits_distinct$team_size, 0.75)
IQR <- Q3 - Q1

# Define lower and upper bounds
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Filter out outliers
user_commits_no_outliers1 <- user_commits_distinct %>%
  filter(team_size >= lower_bound & team_size <= upper_bound)

quantile(user_commits_no_outliers1$team_size, probs = seq(0, 1, .1))
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
   1    1    1    1    2    2    2    3    4    5    8 
Code
# Define the bins based on the updated percentiles
bins_no_outliers1 <- c(1, 2, 3, 5, 8)

# Create labels for the bins
bin_labels_no_outliers1 <- c("[1]", "[2]", "[3-4]", "[5-8]")


# Create a new column with binned team sizes and custom labels
user_commits_no_outliers1 <- user_commits_no_outliers1 %>%
  mutate(team_size_bin = cut(team_size, breaks = bins_no_outliers1, labels = bin_labels_no_outliers1, include.lowest = TRUE, right = FALSE))

# View the table of binned team sizes
table(user_commits_no_outliers1$team_size_bin)

  [1]   [2] [3-4] [5-8] 
 2441  1804  1588   909 

3.4.1.3 No outliers (z-score)

Code
# Calculate the mean and standard deviation of team_size
mean_team_size <- mean(user_commits_distinct$team_size)
sd_team_size <- sd(user_commits_distinct$team_size)

# Calculate Z-scores
user_commits_distinct <- user_commits_distinct %>%
  mutate(z_score = (team_size - mean_team_size) / sd_team_size)

# Define a threshold for Z-scores (commonly 3 or 2.5)
z_threshold <- 3

# Filter out outliers based on Z-score
user_commits_no_outliers_z <- user_commits_distinct %>%
  filter(abs(z_score) <= z_threshold)

quantile(user_commits_no_outliers_z$team_size, probs = seq(0, 1, .1))
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
 1.0  1.0  1.0  1.0  2.0  2.0  3.0  3.0  5.0  7.1 58.0 
Code
# Define the bins based on the updated percentiles
bins_no_outliers_z <- c(1, 2, 3, 5, 8, 58)

# Create labels for the bins
bin_labels_no_outliers_z <- c("[1]", "[2]", "[3-4]", "[5-6]", "[7-58]")

# Create a new column with binned team sizes and custom labels
user_commits_no_outliers_z <- user_commits_no_outliers_z %>%
  mutate(team_size_bin = cut(team_size, breaks = bins_no_outliers_z , labels = bin_labels_no_outliers_z, include.lowest = TRUE, right = FALSE))

# View the table of binned team sizes
table(user_commits_no_outliers_z$team_size_bin)

   [1]    [2]  [3-4]  [5-6] [7-58] 
  2441   1804   1588    791    736 

3.4.2 boxplot viz (With Outliers)

3.4.2.1 Normalized

Code
# Define a custom theme
custom_theme <- theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14, face = "bold"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

# Plot stargazer_count_normalized vs. team_size_bin
ggplot(user_commits_distinct, aes(x = team_size_bin, y = log(stargazer_count_normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_westat(option = "BLUES", drop = FALSE) +
  labs(x = "Team Size Bin", y = "Stargazer Count Normalized", fill = "Team Size") +
  custom_theme

Code
# Plot fork_count_normalized vs. team_size_bin
ggplot(user_commits_distinct, aes(x = team_size_bin, y = log(fork_count_normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Team Size Bin", y = "Fork Count Normalized", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_distinct, aes(x = team_size_bin, y = log(Downloads_Normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Downloads Normalized", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_distinct, aes(x = team_size_bin, y = log(reverse_dep_normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Rev Dep Normalized", fill = "Team Size") +
  custom_theme

3.4.2.2 non-normalized

Code
# Define a custom theme
custom_theme <- theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14, face = "bold"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

# Plot stargazer_count_normalized vs. team_size_bin
ggplot(user_commits_distinct, aes(x = team_size_bin, y = log(stargazer_count), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_westat(option = "BLUES", drop = FALSE) +
  labs(x = "Team Size Bin", y = "Stargazer Count", fill = "Team Size") +
  custom_theme

Code
# Plot fork_count_normalized vs. team_size_bin
ggplot(user_commits_distinct, aes(x = team_size_bin, y = log(fork_count), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Team Size Bin", y = "Fork Count", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_distinct, aes(x = team_size_bin, y = log(Downloads_All_Time), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Downloads", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_distinct, aes(x = team_size_bin, y = log(Reverse_Depends_Count), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Rev Dep", fill = "Team Size") +
  custom_theme

3.4.3 boxplot viz (no outliers IQR)

3.4.3.1 Normalized

Code
# Define a custom theme
custom_theme <- theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14, face = "bold"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

# Plot stargazer_count_normalized vs. team_size_bin
ggplot(user_commits_no_outliers1, aes(x = team_size_bin, y = log(stargazer_count_normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_westat(option = "BLUES", drop = FALSE) +
  labs(x = "Team Size Bin", y = "Stargazer Count Normalized", fill = "Team Size") +
  custom_theme

Code
# Plot fork_count_normalized vs. team_size_bin
ggplot(user_commits_no_outliers1, aes(x = team_size_bin, y = log(fork_count_normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Team Size Bin", y = "Fork Count Normalized", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_no_outliers1, aes(x = team_size_bin, y = log(Downloads_Normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Downloads Normalized", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_no_outliers1, aes(x = team_size_bin, y = log(reverse_dep_normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Rev Dep Normalized", fill = "Team Size") +
  custom_theme

3.4.3.2 non-normalized

Code
# Define a custom theme
custom_theme <- theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14, face = "bold"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

# Plot stargazer_count_normalized vs. team_size_bin
ggplot(user_commits_no_outliers1, aes(x = team_size_bin, y = log(stargazer_count), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_westat(option = "BLUES", drop = FALSE) +
  labs(x = "Team Size Bin", y = "Stargazer Count", fill = "Team Size") +
  custom_theme

Code
# Plot fork_count_normalized vs. team_size_bin
ggplot(user_commits_no_outliers1, aes(x = team_size_bin, y = log(fork_count), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Team Size Bin", y = "Fork Count", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_no_outliers1, aes(x = team_size_bin, y = log(Downloads_All_Time), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Downloads", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_no_outliers1, aes(x = team_size_bin, y = log(Reverse_Depends_Count), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Rev Dep", fill = "Team Size") +
  custom_theme

3.4.4 boxplot viz (no outliers z-score)

3.4.4.1 Normalized

Code
# Define a custom theme
custom_theme <- theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14, face = "bold"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

# Plot stargazer_count_normalized vs. team_size_bin
ggplot(user_commits_no_outliers_z, aes(x = team_size_bin, y = log(stargazer_count_normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_westat(option = "BLUES", drop = FALSE) +
  labs(x = "Team Size Bin", y = "Stargazer Count Normalized", fill = "Team Size") +
  custom_theme

Code
# Plot fork_count_normalized vs. team_size_bin
ggplot(user_commits_no_outliers_z, aes(x = team_size_bin, y = log(fork_count_normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Team Size Bin", y = "Fork Count Normalized", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_no_outliers_z, aes(x = team_size_bin, y = log(Downloads_Normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Downloads Normalized", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_no_outliers_z, aes(x = team_size_bin, y = log(reverse_dep_normalized), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Rev Dep Normalized", fill = "Team Size") +
  custom_theme

3.4.4.2 non-normalized

Code
# Define a custom theme
custom_theme <- theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 12),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14, face = "bold"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

# Plot stargazer_count_normalized vs. team_size_bin
ggplot(user_commits_no_outliers_z, aes(x = team_size_bin, y = log(stargazer_count), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_westat(option = "BLUES", drop = FALSE) +
  labs(x = "Team Size Bin", y = "Stargazer Count", fill = "Team Size") +
  custom_theme

Code
# Plot fork_count_normalized vs. team_size_bin
ggplot(user_commits_no_outliers_z, aes(x = team_size_bin, y = log(fork_count), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set3") +
  labs(x = "Team Size Bin", y = "Fork Count", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_no_outliers_z, aes(x = team_size_bin, y = log(Downloads_All_Time), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Downloads", fill = "Team Size") +
  custom_theme

Code
# Plot Downloads_Normalized vs. team_size_bin
ggplot(user_commits_no_outliers_z, aes(x = team_size_bin, y = log(Reverse_Depends_Count), fill = team_size_bin)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 21, outlier.size = 3) +
  scale_fill_brewer(palette = "Set5") +
  labs(x = "Team Size Bin", y = "Rev Dep", fill = "Team Size") +
  custom_theme

3.4.5 correlation viz (with outliers)

3.4.5.1 normalized

Code
# Calculate correlation between team_size and the normalized counts
correlation_matrix <- user_commits_distinct %>%
  select(team_size, stargazer_count_normalized, fork_count_normalized, Downloads_Normalized, reverse_dep_normalized) %>%
  cor(use = "complete.obs")

# View the correlation matrix
print(correlation_matrix)
                           team_size stargazer_count_normalized
team_size                  1.0000000                 0.75543882
stargazer_count_normalized 0.7554388                 1.00000000
fork_count_normalized      0.7782930                 0.92576329
Downloads_Normalized       0.2759388                 0.10871598
reverse_dep_normalized     0.1787497                 0.07614058
                           fork_count_normalized Downloads_Normalized
team_size                             0.77829300            0.2759388
stargazer_count_normalized            0.92576329            0.1087160
fork_count_normalized                 1.00000000            0.1200080
Downloads_Normalized                  0.12000801            1.0000000
reverse_dep_normalized                0.09906453            0.2785460
                           reverse_dep_normalized
team_size                              0.17874974
stargazer_count_normalized             0.07614058
fork_count_normalized                  0.09906453
Downloads_Normalized                   0.27854596
reverse_dep_normalized                 1.00000000
Code
library(reshape2)
# Melt the correlation matrix
melted_correlation_matrix <- melt(correlation_matrix)

ggplot(data = melted_correlation_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black") +
  scale_fill_gradient2(low = "blue", high = "darkblue", mid = "lightblue", midpoint = 0, limit = c(-1, 1), space = "Lab", name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "Correlation Matrix Heatmap", x = "", y = "")

3.4.5.2 non-normalized

Code
# Calculate correlation between team_size and the normalized counts
correlation_matrix <- user_commits_distinct %>%
  select(team_size, stargazer_count, fork_count, Downloads_All_Time, Reverse_Depends_Count) %>%
  cor(use = "complete.obs")

# View the correlation matrix
print(correlation_matrix)
                      team_size stargazer_count fork_count Downloads_All_Time
team_size             1.0000000       0.7912326  0.7598055          0.3070778
stargazer_count       0.7912326       1.0000000  0.9419632          0.1674230
fork_count            0.7598055       0.9419632  1.0000000          0.1852880
Downloads_All_Time    0.3070778       0.1674230  0.1852880          1.0000000
Reverse_Depends_Count 0.2372756       0.1674554  0.1951223          0.4449235
                      Reverse_Depends_Count
team_size                         0.2372756
stargazer_count                   0.1674554
fork_count                        0.1951223
Downloads_All_Time                0.4449235
Reverse_Depends_Count             1.0000000
Code
# Melt the correlation matrix
melted_correlation_matrix <- melt(correlation_matrix)

ggplot(data = melted_correlation_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black") +
  scale_fill_gradient2(low = "blue", high = "darkblue", mid = "lightblue", midpoint = 0, limit = c(-1, 1), space = "Lab", name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "Correlation Matrix Heatmap", x = "", y = "")

3.4.6 correlation viz (no outliers z-score)

3.4.6.1 normalized

Code
# Calculate correlation between team_size and the normalized counts
correlation_matrix <- user_commits_no_outliers_z %>%
  select(team_size, stargazer_count_normalized, fork_count_normalized, Downloads_Normalized, reverse_dep_normalized) %>%
  cor(use = "complete.obs")

# View the correlation matrix
print(correlation_matrix)
                           team_size stargazer_count_normalized
team_size                  1.0000000                 0.46840694
stargazer_count_normalized 0.4684069                 1.00000000
fork_count_normalized      0.5738767                 0.84916902
Downloads_Normalized       0.3410986                 0.13893791
reverse_dep_normalized     0.1162250                 0.04734392
                           fork_count_normalized Downloads_Normalized
team_size                              0.5738767            0.3410986
stargazer_count_normalized             0.8491690            0.1389379
fork_count_normalized                  1.0000000            0.1713354
Downloads_Normalized                   0.1713354            1.0000000
reverse_dep_normalized                 0.0756769            0.1463907
                           reverse_dep_normalized
team_size                              0.11622504
stargazer_count_normalized             0.04734392
fork_count_normalized                  0.07567690
Downloads_Normalized                   0.14639068
reverse_dep_normalized                 1.00000000
Code
# Melt the correlation matrix
melted_correlation_matrix <- melt(correlation_matrix)

ggplot(data = melted_correlation_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black") +
  scale_fill_gradient2(low = "blue", high = "darkblue", mid = "lightblue", midpoint = 0, limit = c(-1, 1), space = "Lab", name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "Correlation Matrix Heatmap", x = "", y = "")

3.4.6.2 non-normalized

Code
# Calculate correlation between team_size and the normalized counts
correlation_matrix <- user_commits_no_outliers_z %>%
  select(team_size, stargazer_count, fork_count, Downloads_All_Time, Reverse_Depends_Count) %>%
  cor(use = "complete.obs")

# View the correlation matrix
print(correlation_matrix)
                      team_size stargazer_count fork_count Downloads_All_Time
team_size             1.0000000       0.5699283  0.6312434          0.3692510
stargazer_count       0.5699283       1.0000000  0.8223352          0.2075291
fork_count            0.6312434       0.8223352  1.0000000          0.2449997
Downloads_All_Time    0.3692510       0.2075291  0.2449997          1.0000000
Reverse_Depends_Count 0.1569053       0.1003055  0.1365102          0.2489343
                      Reverse_Depends_Count
team_size                         0.1569053
stargazer_count                   0.1003055
fork_count                        0.1365102
Downloads_All_Time                0.2489343
Reverse_Depends_Count             1.0000000
Code
# Melt the correlation matrix
melted_correlation_matrix <- melt(correlation_matrix)

ggplot(data = melted_correlation_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black") +
  scale_fill_gradient2(low = "blue", high = "darkblue", mid = "lightblue", midpoint = 0, limit = c(-1, 1), space = "Lab", name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "Correlation Matrix Heatmap", x = "", y = "")

3.4.7 models (continuous team_size with outliers)

3.4.7.1 normalized

Code
library(broom)

# Filter out non-positive values and missing values
user_commits_filtered1 <- user_commits_distinct %>%
  filter(stargazer_count_normalized > 0, !is.na(stargazer_count_normalized),
         fork_count_normalized > 0, !is.na(fork_count_normalized),
         Downloads_Normalized > 0, !is.na(Downloads_Normalized),
         reverse_dep_normalized > 0, !is.na(reverse_dep_normalized))

# Perform linear regression and get summary and confidence intervals
model_stargazer <- lm(stargazer_count_normalized ~ team_size, data = user_commits_filtered1)
model_fork <- lm(fork_count_normalized ~ team_size, data = user_commits_filtered1)
model_downloads <- lm(log(Downloads_Normalized) ~ team_size, data = user_commits_filtered1)
model_revdep <- lm(log(reverse_dep_normalized) ~ team_size, data = user_commits_filtered1)

summary(model_stargazer)

Call:
lm(formula = stargazer_count_normalized ~ team_size, data = user_commits_filtered1)

Residuals:
    Min      1Q  Median      3Q     Max 
-693.28   -3.34    3.33    7.63 1143.15 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -8.6722     3.5147  -2.467   0.0139 *  
team_size     2.3348     0.0823  28.368   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 73.48 on 503 degrees of freedom
Multiple R-squared:  0.6154,    Adjusted R-squared:  0.6146 
F-statistic: 804.8 on 1 and 503 DF,  p-value: < 2.2e-16
Code
confint(model_stargazer)
                 2.5 %    97.5 %
(Intercept) -15.577516 -1.766800
team_size     2.173114  2.496518
Code
summary(model_fork)

Call:
lm(formula = fork_count_normalized ~ team_size, data = user_commits_filtered1)

Residuals:
    Min      1Q  Median      3Q     Max 
-248.51   -0.83    2.98    4.34  411.02 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.16034    1.22170  -4.224 2.85e-05 ***
team_size    0.82062    0.02861  28.685  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.54 on 503 degrees of freedom
Multiple R-squared:  0.6206,    Adjusted R-squared:  0.6199 
F-statistic: 822.8 on 1 and 503 DF,  p-value: < 2.2e-16
Code
confint(model_fork)
                 2.5 %     97.5 %
(Intercept) -7.5605956 -2.7600782
team_size    0.7644126  0.8768258
Code
summary(model_downloads)

Call:
lm(formula = log(Downloads_Normalized) ~ team_size, data = user_commits_filtered1)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.4543 -1.7305 -0.1898  1.3366  5.3158 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.767840   0.097607 110.319   <2e-16 ***
team_size    0.021243   0.002286   9.294   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.041 on 503 degrees of freedom
Multiple R-squared:  0.1466,    Adjusted R-squared:  0.1449 
F-statistic: 86.38 on 1 and 503 DF,  p-value: < 2.2e-16
Code
confint(model_downloads)
                  2.5 %      97.5 %
(Intercept) 10.57607342 10.95960736
team_size    0.01675265  0.02573381
Code
summary(model_revdep)

Call:
lm(formula = log(reverse_dep_normalized) ~ team_size, data = user_commits_filtered1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8156 -0.7941 -0.2377  0.5859  4.6569 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.304352   0.052235 -24.971  < 2e-16 ***
team_size    0.006325   0.001223   5.171 3.36e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.092 on 503 degrees of freedom
Multiple R-squared:  0.05048,   Adjusted R-squared:  0.04859 
F-statistic: 26.74 on 1 and 503 DF,  p-value: 3.364e-07
Code
confint(model_revdep)
                   2.5 %       97.5 %
(Intercept) -1.406977950 -1.201726971
team_size    0.003922063  0.008728401
Code
# Tidy the models
tidy_stargazer <- tidy(model_stargazer, conf.int = TRUE)
tidy_fork <- tidy(model_fork, conf.int = TRUE)
tidy_downloads <- tidy(model_downloads, conf.int = TRUE)
tidy_revdep <- tidy(model_revdep, conf.int = TRUE)

# Combine the tidied data
tidy_combined <- bind_rows(
  tidy_stargazer %>% mutate(model = "Stargazer Count Normalized"),
  tidy_fork %>% mutate(model = "Fork Count Normalized"),
  tidy_downloads %>% mutate(model = "Log Downloads Normalized"),
  tidy_revdep %>% mutate(model = "Log Rev Dep Normalized")
)

# Filter out the intercept terms
tidy_combined <- tidy_combined %>% filter(term == "team_size")

# Determine y-axis limits to ensure visibility of confidence intervals
y_limits <- range(tidy_combined$conf.low, tidy_combined$conf.high)

# Create the plot
ggplot(tidy_combined, aes(x = model, y = estimate, ymin = conf.low, ymax = conf.high, color = model)) +
  geom_pointrange(size = 1.2) +
  geom_point(size = 3) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray") +
  coord_flip() +
  scale_y_continuous(limits = y_limits) + # Adjust the limits based on confidence intervals
  labs(title = "Confidence Intervals for Team Size Coefficients",
       x = "Model",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5), # Reduce title size
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12),
    legend.position = "none"
  ) +
  scale_color_brewer(palette = "Set2")

3.4.7.2 non-normalized

Code
# Filter out non-positive values and missing values
user_commits_filtered2 <- user_commits_distinct %>%
  filter(stargazer_count > 0, !is.na(stargazer_count),
         fork_count > 0, !is.na(fork_count),
         Downloads_All_Time > 0, !is.na(Downloads_All_Time),
         Reverse_Depends_Count > 0, !is.na(Reverse_Depends_Count))

# Perform linear regression and get summary and confidence intervals
model_stargazer <- lm(stargazer_count ~ team_size, data = user_commits_filtered2)
model_fork <- lm(fork_count ~ team_size, data = user_commits_filtered2)
model_downloads <- lm(log(Downloads_All_Time) ~ team_size, data = user_commits_filtered2)
model_revdep <- lm(log(Reverse_Depends_Count) ~ team_size, data = user_commits_filtered2)

summary(model_stargazer)

Call:
lm(formula = stargazer_count ~ team_size, data = user_commits_filtered2)

Residuals:
    Min      1Q  Median      3Q     Max 
-7359.0   -26.5    56.7    98.6 10930.6 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -119.8956    33.6007  -3.568 0.000394 ***
team_size     24.3104     0.7868  30.897  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 702.5 on 503 degrees of freedom
Multiple R-squared:  0.6549,    Adjusted R-squared:  0.6542 
F-statistic: 954.6 on 1 and 503 DF,  p-value: < 2.2e-16
Code
confint(model_stargazer)
                 2.5 %    97.5 %
(Intercept) -185.91056 -53.88072
team_size     22.76449  25.85622
Code
summary(model_fork)

Call:
lm(formula = fork_count ~ team_size, data = user_commits_filtered2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2631.8    -9.0    33.7    46.1  3907.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -58.4527    12.1551  -4.809 2.01e-06 ***
team_size     8.5845     0.2846  30.160  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 254.1 on 503 degrees of freedom
Multiple R-squared:  0.6439,    Adjusted R-squared:  0.6432 
F-statistic: 909.6 on 1 and 503 DF,  p-value: < 2.2e-16
Code
confint(model_fork)
                 2.5 %     97.5 %
(Intercept) -82.333718 -34.571606
team_size     8.025327   9.143766
Code
summary(model_downloads)

Call:
lm(formula = log(Downloads_All_Time) ~ team_size, data = user_commits_filtered2)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3728 -1.6455 -0.0803  1.5539  4.9093 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12.920120   0.105600 122.350   <2e-16 ***
team_size    0.021367   0.002473   8.641   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.208 on 503 degrees of freedom
Multiple R-squared:  0.1293,    Adjusted R-squared:  0.1275 
F-statistic: 74.66 on 1 and 503 DF,  p-value: < 2.2e-16
Code
confint(model_downloads)
                  2.5 %      97.5 %
(Intercept) 12.71264851 13.12759066
team_size    0.01650887  0.02622552
Code
summary(model_revdep)

Call:
lm(formula = log(Reverse_Depends_Count) ~ team_size, data = user_commits_filtered2)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5184 -0.7519 -0.2850  0.5455  4.6885 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.727682   0.051652   14.09  < 2e-16 ***
team_size   0.008079   0.001210    6.68 6.37e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.08 on 503 degrees of freedom
Multiple R-squared:  0.08147,   Adjusted R-squared:  0.07965 
F-statistic: 44.62 on 1 and 503 DF,  p-value: 6.366e-11
Code
confint(model_revdep)
                  2.5 %     97.5 %
(Intercept) 0.626202254 0.82916095
team_size   0.005702663 0.01045532
Code
# Tidy the models
tidy_stargazer <- tidy(model_stargazer, conf.int = TRUE)
tidy_fork <- tidy(model_fork, conf.int = TRUE)
tidy_downloads <- tidy(model_downloads, conf.int = TRUE)
tidy_revdep <- tidy(model_revdep, conf.int = TRUE)

# Combine the tidied data
tidy_combined <- bind_rows(
  tidy_stargazer %>% mutate(model = "Stargazer Count"),
  tidy_fork %>% mutate(model = "Fork Count"),
  tidy_downloads %>% mutate(model = "Downloads"),
  tidy_revdep %>% mutate(model = "Rev Dep")
)

# Filter out the intercept terms
tidy_combined <- tidy_combined %>% filter(term == "team_size")

# Determine y-axis limits to ensure visibility of confidence intervals
y_limits <- range(tidy_combined$conf.low, tidy_combined$conf.high)

# Create the plot
ggplot(tidy_combined, aes(x = model, y = estimate, ymin = conf.low, ymax = conf.high, color = model)) +
  geom_pointrange(size = 1.2) +
  geom_point(size = 3) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray") +
  coord_flip() +
  scale_y_continuous(limits = y_limits) + # Adjust the limits based on confidence intervals
  labs(title = "Confidence Intervals for Team Size Coefficients",
       x = "Model",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5), # Reduce title size
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12),
    legend.position = "none"
  ) +
  scale_color_brewer(palette = "Set2")

3.4.8 models (continuous team_size without outliers z-score)

3.4.8.1 normalized

Code
# Filter out non-positive values and missing values
user_commits_filtered3 <- user_commits_no_outliers_z %>%
  filter(stargazer_count_normalized > 0, !is.na(stargazer_count_normalized),
         fork_count_normalized > 0, !is.na(fork_count_normalized),
         Downloads_Normalized > 0, !is.na(Downloads_Normalized),
         reverse_dep_normalized > 0, !is.na(reverse_dep_normalized))

# Perform linear regression and get summary and confidence intervals
model_stargazer <- lm(stargazer_count_normalized ~ team_size, data = user_commits_filtered3)
model_fork <- lm(fork_count_normalized ~ team_size, data = user_commits_filtered3)
model_downloads <- lm(log(Downloads_Normalized) ~ team_size, data = user_commits_filtered3)
model_revdep <- lm(log(reverse_dep_normalized) ~ team_size, data = user_commits_filtered3)

summary(model_stargazer)

Call:
lm(formula = stargazer_count_normalized ~ team_size, data = user_commits_filtered3)

Residuals:
    Min      1Q  Median      3Q     Max 
-67.293  -7.024  -2.675   0.630 290.788 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.9434     1.6179  -0.583     0.56    
team_size     2.0315     0.1269  16.005   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.34 on 475 degrees of freedom
Multiple R-squared:  0.3503,    Adjusted R-squared:  0.349 
F-statistic: 256.1 on 1 and 475 DF,  p-value: < 2.2e-16
Code
confint(model_stargazer)
                2.5 %   97.5 %
(Intercept) -4.122506 2.235645
team_size    1.782113 2.280959
Code
summary(model_fork)

Call:
lm(formula = fork_count_normalized ~ team_size, data = user_commits_filtered3)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.757  -1.314  -0.325   0.376  35.070 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.30624    0.26143  -1.171    0.242    
team_size    0.44082    0.02051  21.492   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.257 on 475 degrees of freedom
Multiple R-squared:  0.493, Adjusted R-squared:  0.4919 
F-statistic: 461.9 on 1 and 475 DF,  p-value: < 2.2e-16
Code
confint(model_fork)
                 2.5 %    97.5 %
(Intercept) -0.8199446 0.2074625
team_size    0.4005189 0.4811271
Code
summary(model_downloads)

Call:
lm(formula = log(Downloads_Normalized) ~ team_size, data = user_commits_filtered3)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9299 -1.4529 -0.1037  1.2123  5.0561 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.93952    0.10758   92.39   <2e-16 ***
team_size    0.11191    0.00844   13.26   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.752 on 475 degrees of freedom
Multiple R-squared:  0.2701,    Adjusted R-squared:  0.2686 
F-statistic: 175.8 on 1 and 475 DF,  p-value: < 2.2e-16
Code
confint(model_downloads)
                2.5 %     97.5 %
(Intercept) 9.7281316 10.1509103
team_size   0.0953228  0.1284931
Code
summary(model_revdep)

Call:
lm(formula = log(reverse_dep_normalized) ~ team_size, data = user_commits_filtered3)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8315 -0.7179 -0.2284  0.5128  4.6767 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.505822   0.062169 -24.221  < 2e-16 ***
team_size    0.026514   0.004878   5.436 8.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.012 on 475 degrees of freedom
Multiple R-squared:  0.05856,   Adjusted R-squared:  0.05658 
F-statistic: 29.55 on 1 and 475 DF,  p-value: 8.734e-08
Code
confint(model_revdep)
                  2.5 %      97.5 %
(Intercept) -1.62798240 -1.38366168
team_size    0.01692987  0.03609875
Code
# Tidy the models
tidy_stargazer <- tidy(model_stargazer, conf.int = TRUE)
tidy_fork <- tidy(model_fork, conf.int = TRUE)
tidy_downloads <- tidy(model_downloads, conf.int = TRUE)
tidy_revdep <- tidy(model_revdep, conf.int = TRUE)

# Combine the tidied data
tidy_combined <- bind_rows(
  tidy_stargazer %>% mutate(model = "Stargazer Count Normalized"),
  tidy_fork %>% mutate(model = "Fork Count Normalized"),
  tidy_downloads %>% mutate(model = "Log Downloads Normalized"),
  tidy_revdep %>% mutate(model = "Log Rev Dep Normalized")
)

# Filter out the intercept terms
tidy_combined <- tidy_combined %>% filter(term == "team_size")

# Determine y-axis limits to ensure visibility of confidence intervals
y_limits <- range(tidy_combined$conf.low, tidy_combined$conf.high)

# Create the plot
ggplot(tidy_combined, aes(x = model, y = estimate, ymin = conf.low, ymax = conf.high, color = model)) +
  geom_pointrange(size = 1.2) +
  geom_point(size = 3) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray") +
  coord_flip() +
  scale_y_continuous(limits = y_limits) + # Adjust the limits based on confidence intervals
  labs(title = "Confidence Intervals for Team Size Coefficients",
       x = "Model",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5), # Reduce title size
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12),
    legend.position = "none"
  ) +
  scale_color_brewer(palette = "Set2")

3.4.8.2 non-normalized

Code
# Filter out non-positive values and missing values
user_commits_filtered4 <- user_commits_no_outliers_z %>%
  filter(stargazer_count > 0, !is.na(stargazer_count),
         fork_count > 0, !is.na(fork_count),
         Downloads_All_Time > 0, !is.na(Downloads_All_Time),
         Reverse_Depends_Count > 0, !is.na(Reverse_Depends_Count))

# Perform linear regression and get summary and confidence intervals
model_stargazer <- lm(stargazer_count ~ team_size, data = user_commits_filtered4)
model_fork <- lm(fork_count ~ team_size, data = user_commits_filtered4)
model_downloads <- lm(log(Downloads_All_Time) ~ team_size, data = user_commits_filtered4)
model_revdep <- lm(log(Reverse_Depends_Count) ~ team_size, data = user_commits_filtered4)

summary(model_stargazer)

Call:
lm(formula = stargazer_count ~ team_size, data = user_commits_filtered4)

Residuals:
    Min      1Q  Median      3Q     Max 
-705.46  -57.03  -15.65    9.58 1986.20 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -20.870     13.223  -1.578    0.115    
team_size     18.762      1.037  18.085   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 215.3 on 475 degrees of freedom
Multiple R-squared:  0.4078,    Adjusted R-squared:  0.4065 
F-statistic: 327.1 on 1 and 475 DF,  p-value: < 2.2e-16
Code
confint(model_stargazer)
                2.5 %    97.5 %
(Intercept) -46.85223  5.113025
team_size    16.72326 20.800338
Code
summary(model_fork)

Call:
lm(formula = fork_count ~ team_size, data = user_commits_filtered4)

Residuals:
    Min      1Q  Median      3Q     Max 
-132.02  -11.93   -2.18    4.07  419.98 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -5.1981     2.4687  -2.106   0.0358 *  
team_size     4.1882     0.1937  21.624   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.2 on 475 degrees of freedom
Multiple R-squared:  0.4961,    Adjusted R-squared:  0.495 
F-statistic: 467.6 on 1 and 475 DF,  p-value: < 2.2e-16
Code
confint(model_fork)
                 2.5 %    97.5 %
(Intercept) -10.048919 -0.347224
team_size     3.807587  4.568761
Code
summary(model_downloads)

Call:
lm(formula = log(Downloads_All_Time) ~ team_size, data = user_commits_filtered4)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8558 -1.4676 -0.0868  1.3518  4.7035 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12.082794   0.119156  101.40   <2e-16 ***
team_size    0.112912   0.009349   12.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.94 on 475 degrees of freedom
Multiple R-squared:  0.2349,    Adjusted R-squared:  0.2333 
F-statistic: 145.9 on 1 and 475 DF,  p-value: < 2.2e-16
Code
confint(model_downloads)
                  2.5 %     97.5 %
(Intercept) 11.84865489 12.3169324
team_size    0.09454155  0.1312816
Code
summary(model_revdep)

Call:
lm(formula = log(Reverse_Depends_Count) ~ team_size, data = user_commits_filtered4)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0066 -0.6009 -0.4928  0.4789  4.7078 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.456771   0.058992   7.743 5.91e-14 ***
team_size   0.036043   0.004628   7.787 4.33e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9605 on 475 degrees of freedom
Multiple R-squared:  0.1132,    Adjusted R-squared:  0.1114 
F-statistic: 60.64 on 1 and 475 DF,  p-value: 4.326e-14
Code
confint(model_revdep)
                 2.5 %    97.5 %
(Intercept) 0.34085461 0.5726880
team_size   0.02694846 0.0451376
Code
# Tidy the models
tidy_stargazer <- tidy(model_stargazer, conf.int = TRUE)
tidy_fork <- tidy(model_fork, conf.int = TRUE)
tidy_downloads <- tidy(model_downloads, conf.int = TRUE)
tidy_revdep <- tidy(model_revdep, conf.int = TRUE)

# Combine the tidied data
tidy_combined <- bind_rows(
  tidy_stargazer %>% mutate(model = "Stargazer Count"),
  tidy_fork %>% mutate(model = "Fork Count"),
  tidy_downloads %>% mutate(model = "Downloads"),
  tidy_revdep %>% mutate(model = "Rev Dep")
)

# Filter out the intercept terms
tidy_combined <- tidy_combined %>% filter(term == "team_size")

# Determine y-axis limits to ensure visibility of confidence intervals
y_limits <- range(tidy_combined$conf.low, tidy_combined$conf.high)

# Create the plot
ggplot(tidy_combined, aes(x = model, y = estimate, ymin = conf.low, ymax = conf.high, color = model)) +
  geom_pointrange(size = 1.2) +
  geom_point(size = 3) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray") +
  coord_flip() +
  scale_y_continuous(limits = y_limits) + # Adjust the limits based on confidence intervals
  labs(title = "Confidence Intervals for Team Size Coefficients",
       x = "Model",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5), # Reduce title size
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12),
    legend.position = "none"
  ) +
  scale_color_brewer(palette = "Set2")

3.4.9 models (team_size_bin with outliers)

3.4.9.1 normalized

Code
user_commits_filtered1$team_size_bin <- as.factor(user_commits_filtered1$team_size_bin)

# Perform regression analysis using team_size_bin
model_stargazer_bin <- lm(stargazer_count_normalized ~ team_size_bin, data = user_commits_filtered1)
model_fork_bin <- lm(fork_count_normalized ~ team_size_bin, data = user_commits_filtered1)
model_downloads_bin <- lm(Downloads_Normalized ~ team_size_bin, data = user_commits_filtered1)
model_revdep_bin <- lm(reverse_dep_normalized ~ team_size_bin, data = user_commits_filtered1)

# Tidy the models
tidy_stargazer_bin <- tidy(model_stargazer_bin, conf.int = TRUE)
tidy_fork_bin <- tidy(model_fork_bin, conf.int = TRUE)
tidy_downloads_bin <- tidy(model_downloads_bin, conf.int = TRUE)
tidy_revdep_bin <- tidy(model_revdep_bin, conf.int = TRUE)

# Combine the tidied data
tidy_combined_bin <- bind_rows(
  tidy_stargazer_bin %>% mutate(model = "Stargazer Count Normalized"),
  tidy_fork_bin %>% mutate(model = "Fork Count Normalized"),
  tidy_downloads_bin %>% mutate(model = "Downloads Normalized"),
  tidy_revdep_bin %>% mutate(model = "Rev Dep Normalized")
)

# Filter out the intercept terms
tidy_combined_bin <- tidy_combined_bin %>% filter(term != "(Intercept)")

# Create the plot
ggplot(tidy_combined_bin, aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high, color = model)) +
  geom_pointrange(size = 1.2) +
  geom_point(size = 3) +
  coord_flip() +
  facet_wrap(~ model, scales = "free_y") +
  labs(title = "Regression Coefficients for Team Size Bins",
       x = "Team Size Bin",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 12, face = "bold"),
    axis.title.y = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 10),
    strip.text = element_text(size = 12, face = "bold"),
    legend.position = "bottom",
    legend.title = element_blank()
  ) +
  scale_color_brewer(palette = "Set1")

Code
ggplot(tidy_stargazer_bin %>% filter(term != "(Intercept)"), 
       aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange(size = 1.2, color = "darkblue") +
  geom_point(size = 4, shape = 21, fill = "blue") +
  coord_flip() +
  labs(title = "Regression Coefficients for Team Size Bins (Stargazer Count Normalized)",
       x = "Team Size Bin",
       y = "Coefficient Estimate") +
  theme_bw() +
  theme(
    plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12),
    panel.grid.major = element_line(color = "gray", size = 0.5),
    panel.grid.minor = element_line(color = "lightgray", size = 0.25),
    panel.background = element_rect(fill = "white", color = "black")
  )

Code
ggplot(tidy_fork_bin %>% filter(term != "(Intercept)"), 
       aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange(size = 1.2, color = "darkgreen") +
  geom_point(size = 4, shape = 21, fill = "green") +
  coord_flip() +
  labs(title = "Regression Coefficients for Team Size Bins (Fork Count Normalized)",
       x = "Team Size Bin",
       y = "Coefficient Estimate") +
  theme_bw() +
  theme(
    plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12),
    panel.grid.major = element_line(color = "gray", size = 0.5),
    panel.grid.minor = element_line(color = "lightgray", size = 0.25),
    panel.background = element_rect(fill = "white", color = "black")
  )

Code
ggplot(tidy_downloads_bin %>% filter(term != "(Intercept)"), 
       aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange(size = 1.2, color = "darkred") +
  geom_point(size = 4, shape = 21, fill = "red") +
  coord_flip() +
  labs(title = "Regression Coefficients for Team Size Bins (Log Downloads Normalized)",
       x = "Team Size Bin",
       y = "Coefficient Estimate") +
  theme_bw() +
  theme(
    plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12),
    panel.grid.major = element_line(color = "gray", size = 0.5),
    panel.grid.minor = element_line(color = "lightgray", size = 0.25),
    panel.background = element_rect(fill = "white", color = "black")
  )

Code
ggplot(tidy_revdep_bin %>% filter(term != "(Intercept)"), 
       aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange(size = 1.2, color = "darkred") +
  geom_point(size = 4, shape = 21, fill = "red") +
  coord_flip() +
  labs(title = "Regression Coefficients for Team Size Bins (Log revdep Normalized)",
       x = "Team Size Bin",
       y = "Coefficient Estimate") +
  theme_bw() +
  theme(
    plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 14, face = "bold"),
    axis.title.y = element_text(size = 14, face = "bold"),
    axis.text = element_text(size = 12),
    panel.grid.major = element_line(color = "gray", size = 0.5),
    panel.grid.minor = element_line(color = "lightgray", size = 0.25),
    panel.background = element_rect(fill = "white", color = "black")
  )

Code
library(multcomp)

# Perform Tukey's HSD test
tukey_test <- glht(model_stargazer_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = stargazer_count_normalized ~ team_size_bin, data = user_commits_filtered1)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)    
[2] - [1] == 0          2.077     21.749   0.095  0.99998    
[3-4] - [1] == 0        3.696     20.232   0.183  0.99974    
[5-7] - [1] == 0        7.903     21.749   0.363  0.99613    
[8-880] - [1] == 0     62.245     19.093   3.260  0.01008 *  
[3-4] - [2] == 0        1.620     17.061   0.095  0.99998    
[5-7] - [2] == 0        5.826     18.835   0.309  0.99793    
[8-880] - [2] == 0     60.168     15.694   3.834  0.00133 ** 
[5-7] - [3-4] == 0      4.206     17.061   0.247  0.99915    
[8-880] - [3-4] == 0   58.548     13.514   4.332  < 0.001 ***
[8-880] - [5-7] == 0   54.342     15.694   3.463  0.00499 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_fork_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = fork_count_normalized ~ team_size_bin, data = user_commits_filtered1)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)   
[2] - [1] == 0         0.2323     7.6877   0.030  1.00000   
[3-4] - [1] == 0       0.7922     7.1516   0.111  0.99996   
[5-7] - [1] == 0       1.8214     7.6877   0.237  0.99927   
[8-880] - [1] == 0    17.6007     6.7491   2.608  0.06775 . 
[3-4] - [2] == 0       0.5599     6.0308   0.093  0.99998   
[5-7] - [2] == 0       1.5891     6.6578   0.239  0.99925   
[8-880] - [2] == 0    17.3684     5.5476   3.131  0.01517 * 
[5-7] - [3-4] == 0     1.0292     6.0308   0.171  0.99980   
[8-880] - [3-4] == 0  16.8085     4.7770   3.519  0.00411 **
[8-880] - [5-7] == 0  15.7793     5.5476   2.844  0.03582 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_revdep_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = reverse_dep_normalized ~ team_size_bin, data = user_commits_filtered1)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)  
[2] - [1] == 0       -0.18017    0.53574  -0.336   0.9971  
[3-4] - [1] == 0     -0.05556    0.49838  -0.111   1.0000  
[5-7] - [1] == 0      0.77634    0.53574   1.449   0.5881  
[8-880] - [1] == 0    1.00912    0.47033   2.146   0.1968  
[3-4] - [2] == 0      0.12462    0.42027   0.297   0.9982  
[5-7] - [2] == 0      0.95651    0.46396   2.062   0.2320  
[8-880] - [2] == 0    1.18929    0.38660   3.076   0.0181 *
[5-7] - [3-4] == 0    0.83189    0.42027   1.979   0.2705  
[8-880] - [3-4] == 0  1.06468    0.33290   3.198   0.0122 *
[8-880] - [5-7] == 0  0.23278    0.38660   0.602   0.9738  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_downloads_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Downloads_Normalized ~ team_size_bin, data = user_commits_filtered1)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)    
[2] - [1] == 0          16770     310663   0.054    1.000    
[3-4] - [1] == 0       -19646     288997  -0.068    1.000    
[5-7] - [1] == 0       110083     310663   0.354    0.996    
[8-880] - [1] == 0    1501247     272734   5.504   <1e-05 ***
[3-4] - [2] == 0       -36416     243704  -0.149    1.000    
[5-7] - [2] == 0        93313     269042   0.347    0.997    
[8-880] - [2] == 0    1484477     224178   6.622   <1e-05 ***
[5-7] - [3-4] == 0     129729     243704   0.532    0.983    
[8-880] - [3-4] == 0  1520893     193039   7.879   <1e-05 ***
[8-880] - [5-7] == 0  1391164     224178   6.206   <1e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

3.4.9.2 nonnormalzied

Code
user_commits_filtered2$team_size_bin <- as.factor(user_commits_filtered2$team_size_bin)

# Perform regression analysis using team_size_bin
model_stargazer_bin <- lm(stargazer_count ~ team_size_bin, data = user_commits_filtered2)
model_fork_bin <- lm(fork_count ~ team_size_bin, data = user_commits_filtered2)
model_downloads_bin <- lm(Downloads_All_Time ~ team_size_bin, data = user_commits_filtered2)
model_revdep_bin <- lm(Reverse_Depends_Count ~ team_size_bin, data = user_commits_filtered2)

# Tidy the models
tidy_stargazer_bin <- tidy(model_stargazer_bin, conf.int = TRUE)
tidy_fork_bin <- tidy(model_fork_bin, conf.int = TRUE)
tidy_downloads_bin <- tidy(model_downloads_bin, conf.int = TRUE)
tidy_revdep_bin <- tidy(model_revdep_bin, conf.int = TRUE)

# Combine the tidied data
tidy_combined_bin <- bind_rows(
  tidy_stargazer_bin %>% mutate(model = "Stargazer Count"),
  tidy_fork_bin %>% mutate(model = "Fork Count"),
  tidy_downloads_bin %>% mutate(model = "Downloads"),
  tidy_revdep_bin %>% mutate(model = "Rev Dep")
)

# Filter out the intercept terms
tidy_combined_bin <- tidy_combined_bin %>% filter(term != "(Intercept)")

# Create the plot
ggplot(tidy_combined_bin, aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high, color = model)) +
  geom_pointrange(size = 1.2) +
  geom_point(size = 3) +
  coord_flip() +
  facet_wrap(~ model, scales = "free_y") +
  labs(title = "Regression Coefficients for Team Size Bins",
       x = "Team Size Bin",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 12, face = "bold"),
    axis.title.y = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 10),
    strip.text = element_text(size = 12, face = "bold"),
    legend.position = "bottom",
    legend.title = element_blank()
  ) +
  scale_color_brewer(palette = "Set1")

Code
# Perform Tukey's HSD test
tukey_test <- glht(model_stargazer_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = stargazer_count ~ team_size_bin, data = user_commits_filtered2)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)    
[2] - [1] == 0          17.24     219.92   0.078  0.99999    
[3-4] - [1] == 0        29.94     204.58   0.146  0.99989    
[5-7] - [1] == 0        64.85     219.92   0.295  0.99828    
[8-880] - [1] == 0     603.08     193.07   3.124  0.01544 *  
[3-4] - [2] == 0        12.71     172.52   0.074  0.99999    
[5-7] - [2] == 0        47.61     190.46   0.250  0.99910    
[8-880] - [2] == 0     585.85     158.70   3.692  0.00221 ** 
[5-7] - [3-4] == 0      34.91     172.52   0.202  0.99961    
[8-880] - [3-4] == 0   573.14     136.65   4.194  < 0.001 ***
[8-880] - [5-7] == 0   538.23     158.70   3.392  0.00642 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_fork_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = fork_count ~ team_size_bin, data = user_commits_filtered2)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)   
[2] - [1] == 0          2.280     78.960   0.029  1.00000   
[3-4] - [1] == 0        7.289     73.453   0.099  0.99998   
[5-7] - [1] == 0       16.253     78.960   0.206  0.99958   
[8-880] - [1] == 0    179.502     69.320   2.589  0.07113 . 
[3-4] - [2] == 0        5.009     61.941   0.081  0.99999   
[5-7] - [2] == 0       13.973     68.381   0.204  0.99960   
[8-880] - [2] == 0    177.222     56.979   3.110  0.01614 * 
[5-7] - [3-4] == 0      8.964     61.941   0.145  0.99990   
[8-880] - [3-4] == 0  172.213     49.064   3.510  0.00428 **
[8-880] - [5-7] == 0  163.249     56.979   2.865  0.03379 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_revdep_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Reverse_Depends_Count ~ team_size_bin, data = user_commits_filtered2)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)   
[2] - [1] == 0        -0.4311     5.5480  -0.078  0.99999   
[3-4] - [1] == 0       0.4615     5.1611   0.089  0.99998   
[5-7] - [1] == 0       4.2622     5.5480   0.768  0.93756   
[8-880] - [1] == 0    12.5728     4.8707   2.581  0.07263 . 
[3-4] - [2] == 0       0.8926     4.3522   0.205  0.99959   
[5-7] - [2] == 0       4.6933     4.8047   0.977  0.86161   
[8-880] - [2] == 0    13.0039     4.0035   3.248  0.01043 * 
[5-7] - [3-4] == 0     3.8007     4.3522   0.873  0.90365   
[8-880] - [3-4] == 0  12.1113     3.4474   3.513  0.00417 **
[8-880] - [5-7] == 0   8.3106     4.0035   2.076  0.22580   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_downloads_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Downloads_All_Time ~ team_size_bin, data = user_commits_filtered2)

Linear Hypotheses:
                     Estimate Std. Error t value Pr(>|t|)    
[2] - [1] == 0          98092    2685882   0.037    1.000    
[3-4] - [1] == 0      -201122    2498570  -0.080    1.000    
[5-7] - [1] == 0      1193374    2685882   0.444    0.992    
[8-880] - [1] == 0   13768280    2357961   5.839   <1e-05 ***
[3-4] - [2] == 0      -299214    2106979  -0.142    1.000    
[5-7] - [2] == 0      1095282    2326042   0.471    0.990    
[8-880] - [2] == 0   13670188    1938167   7.053   <1e-05 ***
[5-7] - [3-4] == 0    1394496    2106979   0.662    0.963    
[8-880] - [3-4] == 0 13969402    1668946   8.370   <1e-05 ***
[8-880] - [5-7] == 0 12574906    1938167   6.488   <1e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

3.4.10 models (team_size_bin with no outliers z-score)

3.4.10.1 normalized

Code
user_commits_filtered3$team_size_bin <- as.factor(user_commits_filtered3$team_size_bin)

# Perform regression analysis using team_size_bin
model_stargazer_bin <- lm(stargazer_count_normalized ~ team_size_bin, data = user_commits_filtered3)
model_fork_bin <- lm(fork_count_normalized ~ team_size_bin, data = user_commits_filtered3)
model_downloads_bin <- lm(Downloads_Normalized ~ team_size_bin, data = user_commits_filtered3)
model_revdep_bin <- lm(reverse_dep_normalized ~ team_size_bin, data = user_commits_filtered3)

# Tidy the models
tidy_stargazer_bin <- tidy(model_stargazer_bin, conf.int = TRUE)
tidy_fork_bin <- tidy(model_fork_bin, conf.int = TRUE)
tidy_downloads_bin <- tidy(model_downloads_bin, conf.int = TRUE)
tidy_revdep_bin <- tidy(model_revdep_bin, conf.int = TRUE)

# Combine the tidied data
tidy_combined_bin <- bind_rows(
  tidy_stargazer_bin %>% mutate(model = "Stargazer Count Normalized"),
  tidy_fork_bin %>% mutate(model = "Fork Count Normalized"),
  tidy_downloads_bin %>% mutate(model = "Downloads Normalized"),
  tidy_revdep_bin %>% mutate(model = "Rev Dep Normalized")
)

# Filter out the intercept terms
tidy_combined_bin <- tidy_combined_bin %>% filter(term != "(Intercept)")

# Create the plot
ggplot(tidy_combined_bin, aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high, color = model)) +
  geom_pointrange(size = 1.2) +
  geom_point(size = 3) +
  coord_flip() +
  facet_wrap(~ model, scales = "free_y") +
  labs(title = "Regression Coefficients for Team Size Bins",
       x = "Team Size Bin",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 12, face = "bold"),
    axis.title.y = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 10),
    strip.text = element_text(size = 12, face = "bold"),
    legend.position = "bottom",
    legend.title = element_blank()
  ) +
  scale_color_brewer(palette = "Set1")

Code
# Perform Tukey's HSD test
tukey_test <- glht(model_stargazer_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = stargazer_count_normalized ~ team_size_bin, data = user_commits_filtered3)

Linear Hypotheses:
                    Estimate Std. Error t value Pr(>|t|)    
[2] - [1] == 0         2.077      5.494   0.378    0.996    
[3-4] - [1] == 0       3.696      5.111   0.723    0.950    
[5-6] - [1] == 0       7.903      5.494   1.438    0.596    
[7-58] - [1] == 0     34.909      4.900   7.124   <1e-04 ***
[3-4] - [2] == 0       1.620      4.310   0.376    0.996    
[5-6] - [2] == 0       5.826      4.758   1.225    0.732    
[7-58] - [2] == 0     32.833      4.058   8.092   <1e-04 ***
[5-6] - [3-4] == 0     4.206      4.310   0.976    0.863    
[7-58] - [3-4] == 0   31.213      3.521   8.864   <1e-04 ***
[7-58] - [5-6] == 0   27.006      4.058   6.656   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_fork_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = fork_count_normalized ~ team_size_bin, data = user_commits_filtered3)

Linear Hypotheses:
                    Estimate Std. Error t value Pr(>|t|)    
[2] - [1] == 0        0.2323     0.9737   0.239    0.999    
[3-4] - [1] == 0      0.7922     0.9058   0.875    0.904    
[5-6] - [1] == 0      1.8214     0.9737   1.871    0.328    
[7-58] - [1] == 0     7.0720     0.8685   8.143   <1e-04 ***
[3-4] - [2] == 0      0.5599     0.7639   0.733    0.947    
[5-6] - [2] == 0      1.5891     0.8433   1.884    0.321    
[7-58] - [2] == 0     6.8397     0.7192   9.511   <1e-04 ***
[5-6] - [3-4] == 0    1.0292     0.7639   1.347    0.655    
[7-58] - [3-4] == 0   6.2798     0.6241  10.062   <1e-04 ***
[7-58] - [5-6] == 0   5.2506     0.7192   7.301   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_revdep_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = reverse_dep_normalized ~ team_size_bin, data = user_commits_filtered3)

Linear Hypotheses:
                    Estimate Std. Error t value Pr(>|t|)  
[2] - [1] == 0      -0.18017    0.45490  -0.396   0.9946  
[3-4] - [1] == 0    -0.05556    0.42317  -0.131   0.9999  
[5-6] - [1] == 0     0.77634    0.45490   1.707   0.4239  
[7-58] - [1] == 0    0.65605    0.40571   1.617   0.4801  
[3-4] - [2] == 0     0.12462    0.35685   0.349   0.9967  
[5-6] - [2] == 0     0.95651    0.39395   2.428   0.1066  
[7-58] - [2] == 0    0.83623    0.33596   2.489   0.0922 .
[5-6] - [3-4] == 0   0.83189    0.35685   2.331   0.1331  
[7-58] - [3-4] == 0  0.71161    0.29157   2.441   0.1033  
[7-58] - [5-6] == 0 -0.12028    0.33596  -0.358   0.9964  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_downloads_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Downloads_Normalized ~ team_size_bin, data = user_commits_filtered3)

Linear Hypotheses:
                    Estimate Std. Error t value Pr(>|t|)    
[2] - [1] == 0         16770     245757   0.068    1.000    
[3-4] - [1] == 0      -19646     228618  -0.086    1.000    
[5-6] - [1] == 0      110083     245757   0.448    0.991    
[7-58] - [1] == 0    1035129     219187   4.723 3.01e-05 ***
[3-4] - [2] == 0      -36416     192788  -0.189    1.000    
[5-6] - [2] == 0       93313     212832   0.438    0.992    
[7-58] - [2] == 0    1018359     181504   5.611  < 1e-05 ***
[5-6] - [3-4] == 0    129729     192788   0.673    0.961    
[7-58] - [3-4] == 0  1054775     157522   6.696  < 1e-05 ***
[7-58] - [5-6] == 0   925047     181504   5.097  < 1e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

3.4.10.2 nonnormalzied

Code
user_commits_filtered4$team_size_bin <- as.factor(user_commits_filtered4$team_size_bin)

# Perform regression analysis using team_size_bin
model_stargazer_bin <- lm(stargazer_count ~ team_size_bin, data = user_commits_filtered4)
model_fork_bin <- lm(fork_count ~ team_size_bin, data = user_commits_filtered4)
model_downloads_bin <- lm(Downloads_All_Time ~ team_size_bin, data = user_commits_filtered4)
model_revdep_bin <- lm(Reverse_Depends_Count ~ team_size_bin, data = user_commits_filtered4)

# Tidy the models
tidy_stargazer_bin <- tidy(model_stargazer_bin, conf.int = TRUE)
tidy_fork_bin <- tidy(model_fork_bin, conf.int = TRUE)
tidy_downloads_bin <- tidy(model_downloads_bin, conf.int = TRUE)
tidy_revdep_bin <- tidy(model_revdep_bin, conf.int = TRUE)

# Combine the tidied data
tidy_combined_bin <- bind_rows(
  tidy_stargazer_bin %>% mutate(model = "Stargazer Count"),
  tidy_fork_bin %>% mutate(model = "Fork Count"),
  tidy_downloads_bin %>% mutate(model = "Downloads"),
  tidy_revdep_bin %>% mutate(model = "Rev Dep")
)

# Filter out the intercept terms
tidy_combined_bin <- tidy_combined_bin %>% filter(term != "(Intercept)")

# Create the plot
ggplot(tidy_combined_bin, aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high, color = model)) +
  geom_pointrange(size = 1.2) +
  geom_point(size = 3) +
  coord_flip() +
  facet_wrap(~ model, scales = "free_y") +
  labs(title = "Regression Coefficients for Team Size Bins",
       x = "Team Size Bin",
       y = "Coefficient Estimate") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title.x = element_text(size = 12, face = "bold"),
    axis.title.y = element_text(size = 12, face = "bold"),
    axis.text = element_text(size = 10),
    strip.text = element_text(size = 12, face = "bold"),
    legend.position = "bottom",
    legend.title = element_blank()
  ) +
  scale_color_brewer(palette = "Set1")

Code
# Perform Tukey's HSD test
tukey_test <- glht(model_stargazer_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = stargazer_count ~ team_size_bin, data = user_commits_filtered4)

Linear Hypotheses:
                    Estimate Std. Error t value Pr(>|t|)    
[2] - [1] == 0         17.24      46.46   0.371    0.996    
[3-4] - [1] == 0       29.94      43.22   0.693    0.957    
[5-6] - [1] == 0       64.85      46.46   1.396    0.624    
[7-58] - [1] == 0     309.56      41.44   7.470   <1e-04 ***
[3-4] - [2] == 0       12.71      36.45   0.349    0.997    
[5-6] - [2] == 0       47.61      40.24   1.183    0.756    
[7-58] - [2] == 0     292.32      34.32   8.519   <1e-04 ***
[5-6] - [3-4] == 0     34.91      36.45   0.958    0.871    
[7-58] - [3-4] == 0   279.61      29.78   9.389   <1e-04 ***
[7-58] - [5-6] == 0   244.71      34.32   7.131   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_fork_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = fork_count ~ team_size_bin, data = user_commits_filtered4)

Linear Hypotheses:
                    Estimate Std. Error t value Pr(>|t|)    
[2] - [1] == 0         2.280      9.277   0.246    0.999    
[3-4] - [1] == 0       7.289      8.630   0.845    0.914    
[5-6] - [1] == 0      16.253      9.277   1.752    0.396    
[7-58] - [1] == 0     65.752      8.274   7.947   <1e-04 ***
[3-4] - [2] == 0       5.009      7.278   0.688    0.958    
[5-6] - [2] == 0      13.973      8.034   1.739    0.404    
[7-58] - [2] == 0     63.472      6.852   9.264   <1e-04 ***
[5-6] - [3-4] == 0     8.964      7.278   1.232    0.727    
[7-58] - [3-4] == 0   58.463      5.946   9.832   <1e-04 ***
[7-58] - [5-6] == 0   49.498      6.852   7.224   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_revdep_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Reverse_Depends_Count ~ team_size_bin, data = user_commits_filtered4)

Linear Hypotheses:
                    Estimate Std. Error t value Pr(>|t|)  
[2] - [1] == 0       -0.4311     3.4936  -0.123   0.9999  
[3-4] - [1] == 0      0.4615     3.2499   0.142   0.9999  
[5-6] - [1] == 0      4.2622     3.4936   1.220   0.7344  
[7-58] - [1] == 0     7.6828     3.1158   2.466   0.0974 .
[3-4] - [2] == 0      0.8926     2.7406   0.326   0.9975  
[5-6] - [2] == 0      4.6933     3.0255   1.551   0.5226  
[7-58] - [2] == 0     8.1139     2.5802   3.145   0.0147 *
[5-6] - [3-4] == 0    3.8007     2.7406   1.387   0.6300  
[7-58] - [3-4] == 0   7.2213     2.2393   3.225   0.0115 *
[7-58] - [5-6] == 0   3.4206     2.5802   1.326   0.6692  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
Code
# Perform Tukey's HSD test
tukey_test <- glht(model_downloads_bin, linfct = mcp(team_size_bin = "Tukey"))

# Summary of the Tukey test results
summary(tukey_test)

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Downloads_All_Time ~ team_size_bin, data = user_commits_filtered4)

Linear Hypotheses:
                    Estimate Std. Error t value Pr(>|t|)    
[2] - [1] == 0         98092    2054842   0.048    1.000    
[3-4] - [1] == 0     -201122    1911539  -0.105    1.000    
[5-6] - [1] == 0     1193374    2054842   0.581    0.977    
[7-58] - [1] == 0    9449024    1832678   5.156   <1e-05 ***
[3-4] - [2] == 0     -299214    1611951  -0.186    1.000    
[5-6] - [2] == 0     1095282    1779546   0.615    0.972    
[7-58] - [2] == 0    9350931    1517602   6.162   <1e-05 ***
[5-6] - [3-4] == 0   1394496    1611951   0.865    0.907    
[7-58] - [3-4] == 0  9650145    1317087   7.327   <1e-05 ***
[7-58] - [5-6] == 0  8255650    1517602   5.440   <1e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)